Labeled LDA: a supervised topic model for credit attribution in multi-labeled corpora
EMNLP, pp.248-256, (2009)
A significant portion of the world's text is tagged by readers on social bookmarking websites. Credit attribution is an inherent problem in these corpora because most pages have multiple tags, but the tags do not always apply with equal specificity across the whole document. Solving the credit attribution problem requires associating each...More
PPT (Upload PPT)
- From news sources such as Reuters to modern community web portals like del.icio.us, a significant proportion of the world’s textual data is labeled with multiple human-provided tags.
- These collections reflect the fact that documents are often about more than one thing—for example, a news story about a highway transportation bill might naturally be filed under both transportation and politics, with neither category acting as a clear subset of the other.
- When a user browses to a particular document, a tag-augmented user interface might provide overview visualization cues highlighting which portions of the document are more or less relevant to the tag, helping the user quickly access the information they seek
- From news sources such as Reuters to modern community web portals like del.icio.us, a significant proportion of the world’s textual data is labeled with multiple human-provided tags
- Users who browse for documents with a particular tag might prefer to see summaries that focus on the portion of the document most relevant to the tag, a task we call tag-specific snippet extraction
- This paper has introduced Labeled Latent Dirichlet Allocation, a novel model of multi-labeled corpora that directly addresses the credit assignment problem
- We demonstrate the model’s effectiveness on tasks related to credit attribution within documents, including document visualizations and tagspecific snippet extraction
- Because Labeled Latent Dirichlet Allocation is a graphical model in the Latent Dirichlet Allocation family, it enables a range of natural extensions for future investigation
- The current model does not capture correlations between labels, but such correlations might be introduced by composing Labeled Latent Dirichlet Allocation with newer state of the art topic models like the Correlated Topic Model (Blei and Lafferty, 2006) or the Pachinko Allocation Model (Li and McCallum, 2006)
- L-LDA was judged superior by a wide margin: of the 149 judgments, L-LDA’s output was selected as preferable in 72 cases, whereas SVM’s was selected in only 21.
- The difference between these scores was highly significant (p < .001) by the sign test.
- The authors applied the method to text classification on the del.icio.them dataset, where the documents are naturally multiply labeled and where the tags are less inherently similar than in the Yahoo subcategories
- One of the main advantages of L-LDA on multiply labeled documents comes from the model’s document-specific topic mixture θ.
- The higher probability for the tag more than makes up the difference in the likelihood for all the words except “CMS” (Content Management System), soThis paper has introduced Labeled LDA, a novel model of multi-labeled corpora that directly addresses the credit assignment problem.
- With improved inference for unsupervised Λ, Labeled LDA lends itself naturally to modeling semi-supervised corpora where labels are observed for only some documents
- Table1: Generative process for Labeled LDA: βk is a vector consisting of the parameters of the multinomial distribution corresponding to the kth topic, α are the parameters of the Dirichlet topic prior and η are the parameters of the word prior, while Φk is the label prior for topic k. For the meaning of the projection matrix L(d), please refer to Eq 1
- Table2: Human judgments of tag-specific snippet quality as extracted by L-LDA and SVM. The center column is the number of document-tag pairs for which a system’s snippet was judged superior. The right column is the number of snippets for which all three annotators were in complete agreement (numerator) in the subset of document scored by all three annotators (denominator)
- Table3: Averaged performance across ten runs of multi-label text classification for predicting subsets of the named Yahoo directory categories. Numbers in parentheses are standard deviations across runs. L-LDA outperforms SVMs on 5 subsets with MacroF1, but on no subsets with MicroF1
- Table4: Mean performance across ten runs of multi-label text classification for predicting 20 tags on del.icio.us data. L-LDA outperforms SVMs significantly on both metrics by a 2-tailed, paired t-test at 95% confidence
- This project was supported in part by the President of Stanford University through the IRiSS Initiatives Assessment project
- D. M. Blei and J. Lafferty. 2006. Correlated Topic Models. NIPS, 18:147.
- D. Blei and J McAuliffe. 2007. Supervised Topic Models. In NIPS, volume 21.
- D. M. Blei, A.Y. Ng, and M.I. Jordan. 200Latent Dirichlet allocation. JMLR.
- T. L. Griffiths and M. Steyvers. 200Finding scientific topics. PNAS, 1:5228–35.
- P. Heymann, G. Koutrika, and H. Garcia-Molina. 2008. Can social bookmarking improve web search. In WSDM.
- S. Ji, L. Tang, S. Yu, and J. Ye. 2008. Extracting shared subspace for multi-label classification. In KDD, pages 381–389, New York, NY, USA. ACM.
- H. Kazawa, H. Taira T. Izumitani, and E. Maeda. 2004. Maximal margin labeling for multi-topic text categorization. In NIPS.
- S. Lacoste-Julien, F. Sha, and M. I. Jordan. 200DiscLDA: Discriminative learning for dimensionality reduction and classification. In NIPS, volume 22.
- D. D. Lewis, Y. Yang, T. G. Rose, G. Dietterich, F. Li, and F. Li. 2004. RCV1: A new benchmark collection for text categorization research. JMLR, 5:361– 397.
- Wei Li and Andrew McCallum. 2006. Pachinko allocation: Dag-structured mixture models of topic correlations. In International conference on Machine learning, pages 577–584.
- A. McCallum and K. Nigam. 1998. A comparison of event models for naive bayes text classification. In AAAI-98 workshop on learning for text categorization, volume 7.
- Q. Mei, X. Shen, and C Zhai. 2007. Automatic labeling of multinomial topic models. In KDD.
- D. Ramage, P. Heymann, C. D. Manning, and H. Garcia-Molina. 2009. Clustering the tagged web. In WSDM.
- N. Ueda and K. Saito. 2003. Parametric mixture models for multi-labeled text includes models that can be seen to fit within a dimensionality reduction framework. In NIPS.