Gaussian LDA for Topic Models with Word Embeddings

    International Workshop on the ACL2 Theorem Prover and Its Applications, 2015.

    Cited by: 191|Bibtex|Views13|Links
    EI
    Keywords:
    natural language processingword vectorMetropolis Hastingssemantic coherenceg ldaMore(15+)
    Wei bo:
    We develop a variant of Latent Dirichlet Allocation that operates on continuous space embeddings of words— rather than word types—to impose a prior expectation for semantic coherence

    Abstract:

    Continuous space word embeddings learned from large, unstructured corpora have been shown to be effective at capturing semantic regularities in language. In this paper we replace LDA’s parameterization of “topics” as categorical distributions over opaque word types with multivariate Gaussian distributions on the embedding space. This enco...More

    Code:

    Data:

    Introduction
    • Latent Dirichlet Allocation (LDA) is a Bayesian technique that is widely used for inferring the topic structure in corpora of documents
    • It conceives of a document as a mixture of a small number of topics, and topics as a distribution over word types (Blei et al, 2003).
    • According to the distributional hypothesis (Harris, 1954), words occurring in similar contexts tend to have similar meaning
    • This has given rise to data-driven learning of word vectors that capture lexical and semantic properties, which is a technique of central importance in natural language processing.
    • A word is used as an input to a log-linear classifier with continuous projection layer and words within a certain window before and after the words are predicted
    Highlights
    • Latent Dirichlet Allocation (LDA) is a Bayesian technique that is widely used for inferring the topic structure in corpora of documents
    • Standard human evaluations of topic modeling performance are designed to elicit assessment of semantic coherence (Chang et al, 2009; Newman et al, 2009). This prior preference for semantic coherence is not encoded in the model, and any such observation of semantic coherence found in the inferred topic distributions is, in some sense, accidental
    • We develop a variant of Latent Dirichlet Allocation that operates on continuous space embeddings of words— rather than word types—to impose a prior expectation for semantic coherence
    • Our approach replaces the opaque word types usually modeled in Latent Dirichlet Allocation with continuous space embeddings of these words, which are generated as draws from a multivariate Gaussian
    • DPMM models of word emissions would better model the fact that identical vectors will be generated multiple times, and perhaps add flexibility to the topic distributions that can be captured, without sacrificing our
    • More broadly still, running Latent Dirichlet Allocation on documents consisting of different modalities than just text is facilitated by using the lingua franca of vector space representations, so we expect numerous interesting applications in this area
    Methods
    • The authors evaluate the Word Vector Topic Model on various experimental tasks.
    • Quantitative Analysis Typically topic models are evaluated based on the likelihood of held-out documents.
    • In this case, it is not correct to compare perplexities with models which do topic modeling on words.
    • A higher PMI score implies a more coherent topic as it means the topic words usually co-occur in the same document.
    • In the last line of Table 1, the authors present the PMI score for some of the topics for both Gaussian LDA and traditional multinomial
    Results
    • As mentioned before traditional topic model algorithms cannot handle OOV words.
    • (Zhai and Boyd-Graber, 2013) proposed an extension of LDA which can incorporate new words.
    • They have shown better performances in a document classification task which uses the topic distribution of a document as features on the 20-news group dataset as compared to other fixed vocabulary algorithms.
    • If the document topic distribution is modeled well, the model should be able to do a better job in the classification task
    Conclusion
    • Conclusion and Future

      Work

      While word embeddings have been incorporated to produce state-of-the-art results in numerous supervised natural language processing tasks from the word level to document level ; they have played a more minor role in unsupervised learning problems.
    • More broadly still, running LDA on documents consisting of different modalities than just text is facilitated by using the lingua franca of vector space representations, so the authors expect numerous interesting applications in this area.
    • An interesting extension to the work would be the ability to handle polysemous words based on multi-prototype vector space models (Neelakantan et al, 2014; Reisinger and Mooney, 2010) and the authors keep this as an avenue for future research
    Summary
    • Introduction:

      Latent Dirichlet Allocation (LDA) is a Bayesian technique that is widely used for inferring the topic structure in corpora of documents
    • It conceives of a document as a mixture of a small number of topics, and topics as a distribution over word types (Blei et al, 2003).
    • According to the distributional hypothesis (Harris, 1954), words occurring in similar contexts tend to have similar meaning
    • This has given rise to data-driven learning of word vectors that capture lexical and semantic properties, which is a technique of central importance in natural language processing.
    • A word is used as an input to a log-linear classifier with continuous projection layer and words within a certain window before and after the words are predicted
    • Methods:

      The authors evaluate the Word Vector Topic Model on various experimental tasks.
    • Quantitative Analysis Typically topic models are evaluated based on the likelihood of held-out documents.
    • In this case, it is not correct to compare perplexities with models which do topic modeling on words.
    • A higher PMI score implies a more coherent topic as it means the topic words usually co-occur in the same document.
    • In the last line of Table 1, the authors present the PMI score for some of the topics for both Gaussian LDA and traditional multinomial
    • Results:

      As mentioned before traditional topic model algorithms cannot handle OOV words.
    • (Zhai and Boyd-Graber, 2013) proposed an extension of LDA which can incorporate new words.
    • They have shown better performances in a document classification task which uses the topic distribution of a document as features on the 20-news group dataset as compared to other fixed vocabulary algorithms.
    • If the document topic distribution is modeled well, the model should be able to do a better job in the classification task
    • Conclusion:

      Conclusion and Future

      Work

      While word embeddings have been incorporated to produce state-of-the-art results in numerous supervised natural language processing tasks from the word level to document level ; they have played a more minor role in unsupervised learning problems.
    • More broadly still, running LDA on documents consisting of different modalities than just text is facilitated by using the lingua franca of vector space representations, so the authors expect numerous interesting applications in this area.
    • An interesting extension to the work would be the ability to handle polysemous words based on multi-prototype vector space models (Neelakantan et al, 2014; Reisinger and Mooney, 2010) and the authors keep this as an avenue for future research
    Tables
    • Table1: Top words of some topics from Gaussian-LDA and multinomial LDA on 20-newsgroups for K = 50. Words in Gaussian LDA are ranked based on density assigned to them by the posterior predictive distribution. The last row for each method indicates the PMI score (w.r.t. Wikipedia co-occurence) of the topics fifteen highest ranked words
    • Table2: Accuracy of our model and infvoc on the synthetic datasets. In Gaussian LDA fix, the topic distributions learnt during training were fixed; GLDA(1, 100, 1932) is the online implementation of our model where the documents comes in minibatches. The number in parenthesis denote the size of the batch. The full size of the test corpus is 1932
    • Table3: This table shows the Average L1 Deviation, Average L2 Deviation, Average L∞ Deviation for the difference of the topic distribution of the actual document and the synthetic document on the NIPS corpus. Compared to infvoc, G-LDA achieves a lower deviation of topic distribution inferred on the synthetic documents with respect to actual document. The full size of the test corpus is 174
    Download tables as Excel
    Funding
    • Introduces a fast collapsed Gibbs sampling algorithm based on Cholesky decompositions of covariance matrices of the posterior predictive distributions
    • Develops a variant of LDA that operates on continuous space embeddings of words— rather than word types—to impose a prior expectation for semantic coherence
    • Proposes a new technique for topic modeling by treating the document as a collection of word embeddings and topics itself as multivariate Gaussian distributions in the embedding space
    • Explores several strategies for collapsed Gibbs sampling and derive scalable algorithms, achieving asymptotic speed-up over the naıve implementation
    Reference
    • Eneko Agirre, Enrique Alfonseca, Keith Hall, Jana Kravalova, Marius Pasca, and Aitor Soroa. 2009. A study on similarity and relatedness using distributional and wordnet-based approaches. In Proceedings of NAACL.
      Google ScholarLocate open access versionFindings
    • David M. Blei, Andrew Y. Ng, and Michael I. Jordan. 2003. Latent dirichlet allocation. J. Mach. Learn. Res., 3:993–1022, March.
      Google ScholarLocate open access versionFindings
    • Jonathan Chang, Jordan Boyd-Graber, Chong Wang, Sean Gerrish, and David M. Blei. 2009. Reading tea leaves: How humans interpret topic models. In Neural Information Processing Systems.
      Google ScholarLocate open access versionFindings
    • Ronan Collobert and Jason Weston. 2008. A unified architecture for natural language processing: deep neural networks with multitask learning. In Proceedings of ICML.
      Google ScholarLocate open access versionFindings
    • S. C. Deerwester, S. T. Dumais, T. K. Landauer, G. W. Furnas, and R. A. Harshman. 1990. Indexing by latent semantic analysis. Journal of the American Society for Information Science.
      Google ScholarLocate open access versionFindings
    • Chris Dyer, Adam Lopez, Juri Ganitkevitch, Johnathan Weese, Ferhan Ture, Phil Blunsom, Hendra Setiawan, Vladimir Eidelman, and Philip Resnik. 2010. cdec: A decoder, alignment, and learning framework for finite-state and context-free translation models. In Proceedings of ACL.
      Google ScholarLocate open access versionFindings
    • Mark Finlayson, 2014. Proceedings of the Seventh Global Wordnet Conference, chapter Java Libraries for Accessing the Princeton Wordnet: Comparison and Evaluation, pages 78–85.
      Google ScholarLocate open access versionFindings
    • Juri Ganitkevitch, Benjamin Van Durme, and Chris Callison-Burch. 2013. PPDB: The paraphrase database. In Proceedings of NAACL-HLT, pages 758–764, Atlanta, Georgia, June. Association for Computational Linguistics.
      Google ScholarLocate open access versionFindings
    • T. L. Griffiths and M. Steyvers. 2004. Finding scientific topics. Proceedings of the National Academy of Sciences, 101:5228–5235, April.
      Google ScholarLocate open access versionFindings
    • Jiang Guo, Wanxiang Che, Haifeng Wang, and Ting Liu. 2014. Revisiting embedding features for simple semi-supervised learning. In Proceedings of EMNLP.
      Google ScholarLocate open access versionFindings
    • Mark Hall, Eibe Frank, Geoffrey Holmes, Bernhard Pfahringer, Peter Reutemann, and Ian H. Witten. 2009. The weka data mining software: An update. SIGKDD Explor. Newsl., 11(1):10–18, November.
      Google ScholarLocate open access versionFindings
    • Zellig Harris. 1954. Distributional structure. Word, 10(23):146–162.
      Google ScholarLocate open access versionFindings
    • Karl Moritz Hermann and Phil Blunsom. 2014. Multilingual models for compositional distributed semantics. arXiv preprint arXiv:1404.4641.
      Findings
    • Pengfei Hu, Wenju Liu, Wei Jiang, and Zhanlei Yang. 2012. Latent topic model based on Gaussian-LDA for audio retrieval. In Pattern Recognition, volume 321 of CCIS, pages 556–563. Springer.
      Google ScholarLocate open access versionFindings
    • Aaron Q. Li, Amr Ahmed, Sujith Ravi, and Alexander J. Smola. 2014. Reducing the sampling complexity of topic models. In Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’14.
      Google ScholarLocate open access versionFindings
    • Tomas Mikolov, Wen-tau Yih, and Geoffrey Zweig. 2013. Linguistic regularities in continuous space word representations. In Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 746–751, Atlanta, Georgia, June. Association for Computational Linguistics.
      Google ScholarLocate open access versionFindings
    • George A. Miller. 1995. Wordnet: A lexical database for english. Commun. ACM, 38(11):39–41, November.
      Google ScholarLocate open access versionFindings
    • Kevin P. Murphy. 2012. Machine Learning: A Probabilistic Perspective. The MIT Press.
      Google ScholarFindings
    • Arvind Neelakantan, Jeevan Shankar, Alexandre Passos, and Andrew McCallum. 2014. Efficient nonparametric estimation of multiple embeddings per word in vector space. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing, EMNLP 2014, October 25-29, 2014, Doha, Qatar, A meeting of SIGDAT, a Special Interest Group of the ACL.
      Google ScholarLocate open access versionFindings
    • David Newman, Sarvnaz Karimi, and Lawrence Cavedon. 2009. External evaluation of topic models. pages 11–18, December.
      Google ScholarFindings
    • Joseph Reisinger and Raymond J. Mooney. 2010. Multi-prototype vector-space models of word meaning. In Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, HLT ’10.
      Google ScholarLocate open access versionFindings
    • G. Stewart. 1998. Matrix Algorithms. Society for Industrial and Applied Mathematics.
      Google ScholarFindings
    • Kristina Toutanova, Dan Klein, Christopher D. Manning, and Yoram Singer. 2003. Feature-rich part-ofspeech tagging with a cyclic dependency network. In Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology - Volume 1, NAACL ’03, pages 173–180, Stroudsburg, PA, USA. Association for Computational Linguistics.
      Google ScholarLocate open access versionFindings
    • Joseph Turian, Lev Ratinov, and Yoshua Bengio. 2010. Word representations: a simple and general method for semi-supervised learning. In Proc. of ACL.
      Google ScholarLocate open access versionFindings
    • Peter D. Turney and Patrick Pantel. 2010. From frequency to meaning: Vector space models of semantics. JAIR, pages 141–188.
      Google ScholarLocate open access versionFindings
    • Peter D. Turney. 2006. Similarity of semantic relations. Comput. Linguist., 32(3):379–416, September.
      Google ScholarLocate open access versionFindings
    • Michael D. Vose. 1991. A linear algorithm for generating random numbers with a given distribution. Software Engineering, IEEE Transactions on.
      Google ScholarLocate open access versionFindings
    • Li Wan, Leo Zhu, and Rob Fergus. 2012. A hybrid neural network-latent topic model. In Neil D. Lawrence and Mark A. Girolami, editors, Proceedings of the Fifteenth International Conference on Artificial Intelligence and Statistics (AISTATS-12), volume 22, pages 1287–1294.
      Google ScholarLocate open access versionFindings
    • Limin Yao, David Mimno, and Andrew McCallum. 2009. Efficient methods for topic model inference on streaming document collections. In Proceedings of the 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’09, pages 937–946, New York, NY, USA. ACM.
      Google ScholarLocate open access versionFindings
    • Ke Zhai and Jordan L. Boyd-Graber. 2013. Online latent dirichlet allocation with infinite vocabulary. In ICML (1), volume 28 of JMLR Proceedings, pages 561–569. JMLR.org.
      Google ScholarLocate open access versionFindings
    Your rating :
    0

     

    Tags
    Comments