The dual-sparse topic model: mining focused topics and focused terms in short text

    WWW, 2014.

    Cited by: 69|Bibtex|Views30|Links
    EI
    Keywords:
    topic modelingindividual documenttopic mixturesalient topicreal topicMore(8+)
    Wei bo:
    It is noticeable that Wang and Blei only address the sparsity of topic-word distributions, the “Spike and Slab” prior they introduce to topic modeling is related to the key practice of our treatment

    Abstract:

    Topic modeling has been proved to be an effective method for exploratory text mining. It is a common assumption of most topic models that a document is generated from a mixture of topics. In real-world scenarios, individual documents usually concentrate on several salient topics instead of covering a wide variety of topics. A real topic a...More

    Code:

    Data:

    Introduction
    • The authors are living in an era of information revolution where social media is gradually substituting the role of traditional media.
    • More than 500 million tweets are posted by Twitter users on a daily basis 1
    • This huge volume of usergenerated content, normally in the form of very short documents, contains rich and useful information that can hardly be found in traditional information sources [31].
    • Statistical topic models have been proved to be effective tools for exploratory analysis of the overload of text content [4].
    • A topic model can provide an effective organization of latent semantics to the unstructured text collection
    Highlights
    • We are living in an era of information revolution where social media is gradually substituting the role of traditional media
    • A topic model can provide an effective organization of latent semantics to the unstructured text collection
    • It is noticeable that Wang and Blei only address the sparsity of topic-word distributions, the “Spike and Slab” prior they introduce to topic modeling is related to the key practice of our treatment [27]
    • We address the dual sparsity of the topic representation for documents and the word representation for topics in topic modeling
    • This problem is especially important for analyzing short text such as user-generated content on the Web
    • We propose a novel topic model, DsparseTM, which employs a “Spike and Slab” process and introduces a smoothing prior and a weak smoothing prior for focused/unfocused topics and focused/unfocused terms
    Results
    • 5.4.1 Topic coherence

      The PMI scores of all candidate methods are presented in Table 3.
    • DBLP 20NG Twitter Twitter-A Number of topics DsparseTM LDA STC Conference topic.
    • The proposed DsparseTM model yields the highest PMI score, followed by LDA and the mixture of unigrams, all of which outperform STC by a large margin.
    • The average PMI score of the 22 conference based topics is 0.586.
    • It is interesting to see that both DsparseTM and LDA achieve higher PMI scores than the conference based topics
    Conclusion
    • The authors address the dual sparsity of the topic representation for documents and the word representation for topics in topic modeling.
    • This problem is especially important for analyzing short text such as user-generated content on the Web. The authors propose a novel topic model, DsparseTM, which employs a “Spike and Slab” process and introduces a smoothing prior and a weak smoothing prior for focused/unfocused topics and focused/unfocused terms.
    Summary
    • Introduction:

      The authors are living in an era of information revolution where social media is gradually substituting the role of traditional media.
    • More than 500 million tweets are posted by Twitter users on a daily basis 1
    • This huge volume of usergenerated content, normally in the form of very short documents, contains rich and useful information that can hardly be found in traditional information sources [31].
    • Statistical topic models have been proved to be effective tools for exploratory analysis of the overload of text content [4].
    • A topic model can provide an effective organization of latent semantics to the unstructured text collection
    • Results:

      5.4.1 Topic coherence

      The PMI scores of all candidate methods are presented in Table 3.
    • DBLP 20NG Twitter Twitter-A Number of topics DsparseTM LDA STC Conference topic.
    • The proposed DsparseTM model yields the highest PMI score, followed by LDA and the mixture of unigrams, all of which outperform STC by a large margin.
    • The average PMI score of the 22 conference based topics is 0.586.
    • It is interesting to see that both DsparseTM and LDA achieve higher PMI scores than the conference based topics
    • Conclusion:

      The authors address the dual sparsity of the topic representation for documents and the word representation for topics in topic modeling.
    • This problem is especially important for analyzing short text such as user-generated content on the Web. The authors propose a novel topic model, DsparseTM, which employs a “Spike and Slab” process and introduces a smoothing prior and a weak smoothing prior for focused/unfocused topics and focused/unfocused terms.
    Tables
    • Table1: Variables and Notations
    • Table2: Statistics of the data sets
    • Table3: Topic coherence (PMI) on four data sets
    • Table4: Focused topics and average sparsity ratio for different DBLP categories
    • Table5: Focused topics and average sparsity ratio for different 20NG categories comp.graphics comp.os.ms-windows.misc soc.religion.christian talk.politics.mideast
    • Table6: Focused terms and sparsity ratio of selected topics on DBLP
    • Table7: Focused terms and sparsity ratio of selected topics on 20NG
    • Table8: Most frequent terms of a category, and most relevant topics on DBLP
    • Table9: Most frequent terms of a category, and most relevant topics on 20NG
    Download tables as Excel
    Related work
    • To the best of our knowledge, this is the first study that simultaneously mines focused topics and focused terms from short text. This is achieved through addressing the dualsparsity of topic mixtures and topic-word distributions in topic modeling. Our work is related to the following lines of literature.

      2.1 Classical Probabilistic Topic Models

      Classical probabilistic topic models like the probabilistic latent semantic analysis (PLSA) [13] and the latent Dirichlet allocation (LDA) [6] have been widely adopted in text mining. Without utilizing auxiliary information such as higherlevel context, the classical topic models generally regard each document as an admixture of topics where each topic is defined as a unigram distribution over all the terms in the vocabulary. In practice, the benefit of LDA over PLSA comes from the smoothing process of the document-topic distributions and the topic-word distributions introduced by the Dirichlet prior, which thus alleviates the overfitting problem of PLSA. In order to relax the assumption that the user knows the number of topics a priori, Blei et al proposed a hierarchical topic model utilizing the Chinese Restaurant Process for the construction of infinite number of topics [5]. These classical models generally lack the ability of directly controlling the posterior sparsity [11] of the inferred representations, thus fail to address the skewness of the topic mixtures and the word distributions. Indeed, one could enhance the sparsity by approaching the Dirichlet prior in LDA to zero. However, such a cruel process also inevitably results in a weakened effect of smoothing. Previous work has shown that simply applying a small Dirichlet prior is not only ineffective in controlling the posterior sparsity [32], but also results in compromised, less smooth document-topic and topic-term distributions [27]. In other words, a weakened Dirichlet smoothing yields sparsity only because of the scarcity of information. As a result, when applied to short documents, classical topic models are usually unable to perform as well as for professional documents, even when the Dirichlet priors are optimized.
    Funding
    • This work is supported by the Hong Kong Research Grants Council (RGC) General Research Fund (GRF) Project No CUHK 411211, 411310, the Chinese University of Hong Kong Direct Grant No 4055015, and the National Science Foundation under grant numbers IIS-0968489, IIS-1054199, CCF1048168
    Reference
    • C. Archambeau, B. Lakshminarayanan, and G. Bouchard. Latent IBP compound dirichlet allocation. In NIPS Bayesian Nonparametrics Workshop, 2011.
      Google ScholarLocate open access versionFindings
    • A. Asuncion, M. Welling, P. Smyth, and Y. W. Teh. On smoothing and inference for topic models. In UAI, pages 27–34, 2009.
      Google ScholarLocate open access versionFindings
    • Y. Bengio, A. C. Courville, and J. S. Bergstra. Unsupervised models of images by spike-and-slab rbms. In ICML, pages 1145–1152, 2011.
      Google ScholarLocate open access versionFindings
    • D. M. Blei. Probabilistic topic models. Communications of the ACM, 55(4):77–84, 2012.
      Google ScholarLocate open access versionFindings
    • D. M. Blei, T. L. Griffiths, M. I. Jordan, and J. B. Tenenbaum. Hierarchical topic models and the nested chinese restaurant process. In NIPS, pages 106–114, 2003.
      Google ScholarLocate open access versionFindings
    • D. M. Blei, A. Y. Ng, and M. I. Jordan. Latent dirichlet allocation. JMLR, 3:993–1022, 2003.
      Google ScholarLocate open access versionFindings
    • J. Chang, J. L. Boyd-Graber, S. Gerrish, C. Wang, and D. M. Blei. Reading tea leaves: How humans interpret topic models. In NIPS, pages 288–296, 2009.
      Google ScholarLocate open access versionFindings
    • X. Chen, M. Zhou, and L. Carin. The contextual focused topic model. In KDD, pages 96–104, 2012.
      Google ScholarLocate open access versionFindings
    • A. C. Courville, J. Bergstra, and Y. Bengio. A spike and slab restricted boltzmann machine. In International Conference on Artificial Intelligence and Statistics, pages 233–241, 2011.
      Google ScholarLocate open access versionFindings
    • K. Crammer and Y. Singer. On the algorithmic implementation of multiclass kernel-based vector machines. JMLR, 2:265–292, 2002.
      Google ScholarLocate open access versionFindings
    • J. V. Graca, K. Ganchev, B. Taskar, and F. Pereira. Posterior vs. parameter sparsity in latent variable models. In NIPS, pages 664–672, 2009.
      Google ScholarLocate open access versionFindings
    • T. Griffiths and M. Steyvers. Finding scientific topics. PNAS, 101:5228–5235, 2004.
      Google ScholarLocate open access versionFindings
    • T. Hofmann. Probabilistic latent semantic analysis. In UAI, pages 289–296, 1999.
      Google ScholarLocate open access versionFindings
    • P. O. Hoyer. Non-negative matrix factorization with sparseness constraints. JMLR, 5:1457–1469, 2004.
      Google ScholarLocate open access versionFindings
    • H. Ishwaran and J. S. Rao. Spike and slab variable selection: Frequentist and bayesian strategies. The Annals of Statistics, 33(2):730–773, 2005.
      Google ScholarLocate open access versionFindings
    • A. Kaban, E. Bingham, and T. Hirsimaki. Learning to read between the lines: The aspect bernoulli model. In SDM, pages 462–466, 2004.
      Google ScholarLocate open access versionFindings
    • D. D. Lee and H. S. Seung. Learning the parts of objects by non-negative matrix factorization. Nature, 401(6755):788–791, 1999.
      Google ScholarLocate open access versionFindings
    • Y. Lu, Q. Mei, and C. Zhai. Investigating task performance of probabilistic topic models: an empirical study of plsa and lda. Information Retrieval, 14(2):178–203, 2011.
      Google ScholarLocate open access versionFindings
    • R. Mehrotra, S. Sanner, W. Buntine, and L. Xie. Improving lda topic models for microblogs via tweet pooling and automatic labeling. In SIGIR, pages 889–892, 2013.
      Google ScholarLocate open access versionFindings
    • D. Newman, J. H. Lau, K. Grieser, and T. Baldwin. Automatic evaluation of topic coherence. In NAACL, pages 100–108, 2010.
      Google ScholarLocate open access versionFindings
    • I. Sato and H. Nakagawa. Rethinking collapsed variational bayes inference for lda. In ICML, 2012.
      Google ScholarLocate open access versionFindings
    • E. Saund. A multiply cause mixture model for unsupervised learning. Neural Comput., 7(1):51–71, 1995.
      Google ScholarLocate open access versionFindings
    • M. Shashanka, B. Raj, and P. Smaragdis. Sparse overcomplete latent variable decomposition of counts data. In NIPS, pages 1313–1320, 2007.
      Google ScholarLocate open access versionFindings
    • J. Tang, M. Zhang, and Q. Mei. One theme in all views: Modeling consensus topics in multiple contexts authors. In KDD, pages 5–13, 2013.
      Google ScholarLocate open access versionFindings
    • Y. W. Teh, D. Newman, and M. Welling. A collapsed variational bayesian inference algorithm for latent dirichlet allocation. In NIPS, pages 1353–1360, 2006.
      Google ScholarLocate open access versionFindings
    • H. M. Wallach, D. Mimno, and A. McCallum. Rethinking lda: Why priors matter. In NIPS, pages 1973–1981, 2009.
      Google ScholarLocate open access versionFindings
    • C. Wang and D. M. Blei. Decoupling sparsity and smoothness in the discrete hierarchical dirichlet process. In NIPS, pages 1982–1989, 2009.
      Google ScholarLocate open access versionFindings
    • Q. Wang, J. Xu, H. Li, and N. Craswell. Regularized latent semantic indexing. In SIGIR, pages 685–694, 2011.
      Google ScholarLocate open access versionFindings
    • S. Williamson, C. Wang, K. A. Heller, and D. M. Blei. Focused topic models. In NIPS Workshop on Applications for Topic Models: Text and Beyond, 2009.
      Google ScholarLocate open access versionFindings
    • S. Williamson, C. Wang, K. A. Heller, and D. M. Blei. The ibp compound dirichlet process and its application to focused topic modeling. In ICML, pages 1151–1158, 2010.
      Google ScholarLocate open access versionFindings
    • W. X. Zhao, J. Jiang, J. Weng, J. He, E.-P. Lim, H. Yan, and X. Li. Comparing twitter and traditional media using topic models. In ECIR, pages 338–349, 2011.
      Google ScholarLocate open access versionFindings
    • J. Zhu and E. P. Xing. Sparse topical coding. In UAI, pages 831–838, 2011.
      Google ScholarLocate open access versionFindings
    Your rating :
    0

     

    Tags
    Comments