Labeled LDA: a supervised topic model for credit attribution in multi-labeled corpora

EMNLP, pp.248-256, (2009)

Cited by: 1524|Views223
EI
Full Text
Bibtex
Weibo

Abstract

A significant portion of the world's text is tagged by readers on social bookmarking websites. Credit attribution is an inherent problem in these corpora because most pages have multiple tags, but the tags do not always apply with equal specificity across the whole document. Solving the credit attribution problem requires associating each...More

Code:

Data:

Introduction
  • From news sources such as Reuters to modern community web portals like del.icio.us, a significant proportion of the world’s textual data is labeled with multiple human-provided tags.
  • These collections reflect the fact that documents are often about more than one thing—for example, a news story about a highway transportation bill might naturally be filed under both transportation and politics, with neither category acting as a clear subset of the other.
  • When a user browses to a particular document, a tag-augmented user interface might provide overview visualization cues highlighting which portions of the document are more or less relevant to the tag, helping the user quickly access the information they seek
Highlights
  • From news sources such as Reuters to modern community web portals like del.icio.us, a significant proportion of the world’s textual data is labeled with multiple human-provided tags
  • Users who browse for documents with a particular tag might prefer to see summaries that focus on the portion of the document most relevant to the tag, a task we call tag-specific snippet extraction
  • This paper has introduced Labeled Latent Dirichlet Allocation, a novel model of multi-labeled corpora that directly addresses the credit assignment problem
  • We demonstrate the model’s effectiveness on tasks related to credit attribution within documents, including document visualizations and tagspecific snippet extraction
  • Because Labeled Latent Dirichlet Allocation is a graphical model in the Latent Dirichlet Allocation family, it enables a range of natural extensions for future investigation
  • The current model does not capture correlations between labels, but such correlations might be introduced by composing Labeled Latent Dirichlet Allocation with newer state of the art topic models like the Correlated Topic Model (Blei and Lafferty, 2006) or the Pachinko Allocation Model (Li and McCallum, 2006)
Results
  • L-LDA was judged superior by a wide margin: of the 149 judgments, L-LDA’s output was selected as preferable in 72 cases, whereas SVM’s was selected in only 21.
  • The difference between these scores was highly significant (p < .001) by the sign test.
  • The authors applied the method to text classification on the del.icio.them dataset, where the documents are naturally multiply labeled and where the tags are less inherently similar than in the Yahoo subcategories
Conclusion
  • One of the main advantages of L-LDA on multiply labeled documents comes from the model’s document-specific topic mixture θ.
  • The higher probability for the tag more than makes up the difference in the likelihood for all the words except “CMS” (Content Management System), soThis paper has introduced Labeled LDA, a novel model of multi-labeled corpora that directly addresses the credit assignment problem.
  • With improved inference for unsupervised Λ, Labeled LDA lends itself naturally to modeling semi-supervised corpora where labels are observed for only some documents
Tables
  • Table1: Generative process for Labeled LDA: βk is a vector consisting of the parameters of the multinomial distribution corresponding to the kth topic, α are the parameters of the Dirichlet topic prior and η are the parameters of the word prior, while Φk is the label prior for topic k. For the meaning of the projection matrix L(d), please refer to Eq 1
  • Table2: Human judgments of tag-specific snippet quality as extracted by L-LDA and SVM. The center column is the number of document-tag pairs for which a system’s snippet was judged superior. The right column is the number of snippets for which all three annotators were in complete agreement (numerator) in the subset of document scored by all three annotators (denominator)
  • Table3: Averaged performance across ten runs of multi-label text classification for predicting subsets of the named Yahoo directory categories. Numbers in parentheses are standard deviations across runs. L-LDA outperforms SVMs on 5 subsets with MacroF1, but on no subsets with MicroF1
  • Table4: Mean performance across ten runs of multi-label text classification for predicting 20 tags on del.icio.us data. L-LDA outperforms SVMs significantly on both metrics by a 2-tailed, paired t-test at 95% confidence
Download tables as Excel
Funding
  • This project was supported in part by the President of Stanford University through the IRiSS Initiatives Assessment project
Reference
  • D. M. Blei and J. Lafferty. 2006. Correlated Topic Models. NIPS, 18:147.
    Google ScholarLocate open access versionFindings
  • D. Blei and J McAuliffe. 2007. Supervised Topic Models. In NIPS, volume 21.
    Google ScholarLocate open access versionFindings
  • D. M. Blei, A.Y. Ng, and M.I. Jordan. 200Latent Dirichlet allocation. JMLR.
    Google ScholarFindings
  • T. L. Griffiths and M. Steyvers. 200Finding scientific topics. PNAS, 1:5228–35.
    Google ScholarLocate open access versionFindings
  • P. Heymann, G. Koutrika, and H. Garcia-Molina. 2008. Can social bookmarking improve web search. In WSDM.
    Google ScholarFindings
  • S. Ji, L. Tang, S. Yu, and J. Ye. 2008. Extracting shared subspace for multi-label classification. In KDD, pages 381–389, New York, NY, USA. ACM.
    Google ScholarLocate open access versionFindings
  • H. Kazawa, H. Taira T. Izumitani, and E. Maeda. 2004. Maximal margin labeling for multi-topic text categorization. In NIPS.
    Google ScholarFindings
  • S. Lacoste-Julien, F. Sha, and M. I. Jordan. 200DiscLDA: Discriminative learning for dimensionality reduction and classification. In NIPS, volume 22.
    Google ScholarLocate open access versionFindings
  • D. D. Lewis, Y. Yang, T. G. Rose, G. Dietterich, F. Li, and F. Li. 2004. RCV1: A new benchmark collection for text categorization research. JMLR, 5:361– 397.
    Google ScholarLocate open access versionFindings
  • Wei Li and Andrew McCallum. 2006. Pachinko allocation: Dag-structured mixture models of topic correlations. In International conference on Machine learning, pages 577–584.
    Google ScholarLocate open access versionFindings
  • A. McCallum and K. Nigam. 1998. A comparison of event models for naive bayes text classification. In AAAI-98 workshop on learning for text categorization, volume 7.
    Google ScholarFindings
  • Q. Mei, X. Shen, and C Zhai. 2007. Automatic labeling of multinomial topic models. In KDD.
    Google ScholarFindings
  • D. Ramage, P. Heymann, C. D. Manning, and H. Garcia-Molina. 2009. Clustering the tagged web. In WSDM.
    Google ScholarLocate open access versionFindings
  • N. Ueda and K. Saito. 2003. Parametric mixture models for multi-labeled text includes models that can be seen to fit within a dimensionality reduction framework. In NIPS.
    Google ScholarFindings
Your rating :
0

 

Tags
Comments
数据免责声明
页面数据均来自互联网公开来源、合作出版商和通过AI技术自动分析结果,我们不对页面数据的有效性、准确性、正确性、可靠性、完整性和及时性做出任何承诺和保证。若有疑问,可以通过电子邮件方式联系我们:report@aminer.cn
小科