Targeted Topic Modeling for Focused Analysis

KDD, 2016.

Cited by: 35|Bibtex|Views28|Links
EI
Keywords:
ISBNTargeted Modelinglatent semantic indexingCenter for Tobacco Productstopic modelMore(16+)
Weibo:
Instead of finding all topics from a corpus like existing models based on full modeling, the proposed model focuses on finding topics of a targeted aspect to help the user perform deeper or finer-grained analysis

Abstract:

One of the overarching tasks of document analysis is to find what topics people talk about. One of the main techniques for this purpose is topic modeling. So far many models have been proposed. However, the existing models typically perform full analysis on the whole data to find all topics. This is certainly useful, but in practice we fo...More

Code:

Data:

Introduction
  • One of the important text mining tasks is to discover the topics discussed in a collection of text documents.
  • $15.00 DOI: http://dx.doi.org/10.1145/2939672.2939743 on the entire corpus to discover all topics.
  • This is certainly useful, but it is inevitably coarse.
  • Given a set of tweets about e-cigarette, the user wants to gain insight into topics that have been discussed about children.
  • If a topic model can find topics such as regulations and fears about children smoking e-cigarette that are related to this target, it will be very useful.
  • The proposed targeted analysis problem is defined as follows
Highlights
  • One of the important text mining tasks is to discover the topics discussed in a collection of text documents
  • Topic modeling is one of the main techniques used for this purpose
  • In practice we found that the user almost always wants to perform deeper and more focused analysis on some specific aspects of the data, which we refer to as targets, or targeted aspects in this paper
  • In Equation 6, P(i)@n indicates the precision@n for model (i), given the targeted aspect. #C(i)st is the number of correct words found in the topic st, given that there are ST topics found by model i. #Cmt is the maximum number of correct words from all models
  • We studied the novel problem of targeted modeling
  • Instead of finding all topics from a corpus like existing models based on full modeling, the proposed model focuses on finding topics of a targeted aspect to help the user perform deeper or finer-grained analysis
Methods
  • Data and targeted aspects: Five real-world data sets in different domains are used in the experiments, namely, ECigarette, Cigar, Camera, Cell-Phone and Computer.
  • The first two data sets are tweets collected from Twitter in October 2014.
  • E-Cigarette and Cigar are two types of tobacco-related products, which are the research areas of the last author from health science.
  • The last three datasets are product reviews of three popular electronic products.
  • More detailed information about the data sets is presented in Table 2
Results
  • Evaluation Measure

    The authors use a normalized form of precision that can evaluate both the correctness of the topical words and the number of detected topics in a unified manner as both are important.
  • #Cmt is the maximum number of correct words from all models.
  • #C(i)st is the number of correct words found in the topic st, given that there are ST topics found by model i.
  • This evaluation measure is fair and reasonable because a model may only find one correct topic with high topical word precision but miss some correct topics.
  • If there are multiple topics mixed in a single topic generated by a model, the authors use the best topic based on the number of relevant words in the top 20 words
Conclusion
  • Instead of finding all topics from a corpus like existing models based on full modeling, the proposed model focuses on finding topics of a targeted aspect to help the user perform deeper or finer-grained analysis
  • This is motivated by real-life applications that researchers are often not interested in everything in a corpus but only some aspects of it in order to answer their research questions.
  • Experimental results showed that this is the case and the proposed new model outperforms the state-of-the-art existing models markedly
Summary
  • Introduction:

    One of the important text mining tasks is to discover the topics discussed in a collection of text documents.
  • $15.00 DOI: http://dx.doi.org/10.1145/2939672.2939743 on the entire corpus to discover all topics.
  • This is certainly useful, but it is inevitably coarse.
  • Given a set of tweets about e-cigarette, the user wants to gain insight into topics that have been discussed about children.
  • If a topic model can find topics such as regulations and fears about children smoking e-cigarette that are related to this target, it will be very useful.
  • The proposed targeted analysis problem is defined as follows
  • Methods:

    Data and targeted aspects: Five real-world data sets in different domains are used in the experiments, namely, ECigarette, Cigar, Camera, Cell-Phone and Computer.
  • The first two data sets are tweets collected from Twitter in October 2014.
  • E-Cigarette and Cigar are two types of tobacco-related products, which are the research areas of the last author from health science.
  • The last three datasets are product reviews of three popular electronic products.
  • More detailed information about the data sets is presented in Table 2
  • Results:

    Evaluation Measure

    The authors use a normalized form of precision that can evaluate both the correctness of the topical words and the number of detected topics in a unified manner as both are important.
  • #Cmt is the maximum number of correct words from all models.
  • #C(i)st is the number of correct words found in the topic st, given that there are ST topics found by model i.
  • This evaluation measure is fair and reasonable because a model may only find one correct topic with high topical word precision but miss some correct topics.
  • If there are multiple topics mixed in a single topic generated by a model, the authors use the best topic based on the number of relevant words in the top 20 words
  • Conclusion:

    Instead of finding all topics from a corpus like existing models based on full modeling, the proposed model focuses on finding topics of a targeted aspect to help the user perform deeper or finer-grained analysis
  • This is motivated by real-life applications that researchers are often not interested in everything in a corpus but only some aspects of it in order to answer their research questions.
  • Experimental results showed that this is the case and the proposed new model outperforms the state-of-the-art existing models markedly
Tables
  • Table1: Definitions of notations pect. Then when the targeted aspect is given by a user, a document can be identified as relevant or irrelevant to the targeted aspect. r represents this relevance. r ∈ {0, 1}, where r=1 means the document is relevant to the target and r=0 is irrelevant. Another related variable is x, which represents whether a document contains at least one keyword s ∈ S. x ∈ {0, 1}, where x=1 indicates the document contains the keyword(s) and x=0 indicates it does not. For example, when S is {“children”} and a document m says “ecigarette is a gateway to smoking for children” (i.e., example d1), the keyword indicator xm=1 because the document m contains the keyword “children”. In this case (xm=1), m is regarded as relevant (rm=1) because it is unlikely that a short sentence contains the word “children” and is not talking about children. However, this is a soft constraint that can be relaxed by adjusting a control factor λ (presented in Equation 1) and 0 λ 1, i.e., λ controls how much we believe a document contains a keyword is actually relevant. When it comes to the opposite situation that there is no keyword found in a documents m (i.e., xm=0), it is a different case because the document can be either relevant or irrelevant. For instance, the above example d2 is clearly relevant to target children while example d3 is not and they both do not contain the keyword “children”. We will discuss how to handle this case (xm=0) in the following sub-sections
  • Table2: Five datasets, targeted aspects, and initial documents (tweets or review sentences)
  • Table3: Precisions of setting one. The last two rows are (a) the average scores of all targeted aspects of all topics and (b) the improvement achieved by TTM over other models respectively set to 15 or 30 (and choose the one produces higher precision results). The numbers are larger because they do not directly generate topics for the targeted aspect like LDA-PD and TTM. They also produce topics for other non-targeted aspects
  • Table4: Precisions of setting two. The last two rows are (a) the average scores of all targeted aspects of all topics and (b) the improvement achieved by TTM over other models respectively
  • Table5: Topics of aspect children under E-Cig. Errors are italicized and marked in red
  • Table6: Topics of two aspects screen and weight under Camera. Errors are italicized and marked in red
Download tables as Excel
Related work
  • To our knowledge, there is no existing topic model that is able to perform the proposed targeted analysis as we do. Our work is, however, clearly related to the classic topic models such as PLSA [14] and LDA [4] and their variants. These models have been used to discover hidden thematic structures in a collection of documents or corpus. There are numerous existing models (e.g., [25, 24, 34, 23, 8, 11, 27]). They either identify topics only or jointly identify both topics and other types of information. For example, while both LDA and PLSA only identify topics, [30, 22] jointly model both topics and ratings in reviews. [24, 10] model labeled data with the class information. [19, 15] conduct time-series analysis of topics. However, as we indicated in the introduction section, all these models and their variants are full analysis models. They aim to find all topics in the corpus, and none of them is able to perform targeted analysis based on only a specific aspect that is of interest to users. Existing research also proposed several knowledge-based topic models, which can incorporate prior domain knowledge in topic modeling [1, 23, 9, 32] to generate better results. But they are also full-analysis models, and do not help discover related topics of the user interested aspect.
Funding
  • Shuai Wang and Sherry Emery’s research was supported by the National Science Foundation (NSF) under award number NSF1524750 and the National Cancer Institute of the National Institutes of Health (NIH) and FDA Center for Tobacco Products (CTP) under Award Number P50CA179546
  • Bing Liu’s research was supported in part by a grant from National Science Foundation (NSF) under grant no
  • IIS1407927, a NCI grant under grant no
Reference
  • D. Andrzejewski, X. Zhu, and M. Craven. Incorporating domain knowledge into topic modeling via dirichlet forest priors. In ICML, pages 25–32. ACM, 2009.
    Google ScholarLocate open access versionFindings
  • C. Archambeau, B. Lakshminarayanan, and G. Bouchard. Latent ibp compound dirichlet allocation. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 37(2):321–333, 2015.
    Google ScholarLocate open access versionFindings
  • Y. Bengio, A. C. Courville, and J. S. Bergstra. Unsupervised models of images by spike-and-slab rbms. In ICML, pages 1145–1152, 2011.
    Google ScholarLocate open access versionFindings
  • D. M. Blei, A. Y. Ng, and M. I. Jordan. Latent dirichlet allocation. the Journal of machine Learning research, 3:993–1022, 2003.
    Google ScholarLocate open access versionFindings
  • S. Brody and N. Elhadad. An unsupervised aspect-sentiment model for online reviews. In NAACL, pages 804–812. Association for Computational Linguistics, 2010.
    Google ScholarLocate open access versionFindings
  • J. Chang, S. Gerrish, C. Wang, J. L. Boyd-Graber, and D. M. Blei. Reading tea leaves: How humans interpret topic models. In NIPS, pages 288–296, 2009.
    Google ScholarLocate open access versionFindings
  • X. Chen, Y. Qi, B. Bai, Q. Lin, and J. G. Carbonell. Sparse latent semantic analysis. In SDM, pages 474–485. SIAM, 2011.
    Google ScholarLocate open access versionFindings
  • X. Chen, M. Zhou, and L. Carin. The contextual focused topic model. In KDD, pages 96–104. ACM, 2012.
    Google ScholarLocate open access versionFindings
  • Z. Chen and B. Liu. Mining topics in documents: standing on the shoulders of big data. In KDD, pages 1116–1125. ACM, 2014.
    Google ScholarLocate open access versionFindings
  • J. Eisenstein, A. Ahmed, and E. P. Xing. Sparse additive generative models of text. 2011.
    Google ScholarFindings
  • D. Griffiths and M. Tenenbaum. Hierarchical topic models and the nested chinese restaurant process. NIPS, 16:17, 2004.
    Google ScholarLocate open access versionFindings
  • T. L. Griffiths and M. Steyvers. Finding scientific topics. PNAS, 101(suppl 1):5228–5235, 2004.
    Google ScholarLocate open access versionFindings
  • [14] T. Hofmann. Probabilistic latent semantic indexing. In SIGIR, pages 50–57. ACM, 1999.
    Google ScholarLocate open access versionFindings
  • [15] L. Hong, D. Yin, J. Guo, and B. D. Davison. Tracking trends: incorporating term volume into temporal topic models. In KDD, pages 484–492. ACM, 2011.
    Google ScholarLocate open access versionFindings
  • [16] H. Ishwaran and J. S. Rao. Spike and slab variable selection: frequentist and bayesian strategies. Annals of Statistics, pages 730–773, 2005.
    Google ScholarLocate open access versionFindings
  • [17] Y. Jo and A. H. Oh. Aspect and sentiment unification model for online review analysis. In WSDM, pages 815–824. ACM, 2011.
    Google ScholarLocate open access versionFindings
  • [18] T. Lin, W. Tian, Q. Mei, and H. Cheng. The dual-sparse topic model: mining focused topics and focused terms in short text. In WWW, pages 539–550. ACM, 2014.
    Google ScholarLocate open access versionFindings
  • [19] Q. Mei and C. Zhai. Discovering evolutionary theme patterns from text: an exploration of temporal text mining. In KDD, pages 198–207. ACM, 2005.
    Google ScholarLocate open access versionFindings
  • [20] K. Min, Z. Zhang, J. Wright, and Y. Ma. Decomposing background topics from keywords by principal component pursuit. In CIKM, pages 269–278. ACM, 2010.
    Google ScholarLocate open access versionFindings
  • [21] T. J. Mitchell and J. J. Beauchamp. Bayesian variable selection in linear regression. Journal of the American Statistical Association, 83(404):1023–1032, 1988.
    Google ScholarLocate open access versionFindings
  • [22] S. Moghaddam and M. Ester. The flda model for aspect-based opinion mining: Addressing the cold start problem. In WWW, 2013.
    Google ScholarFindings
  • [23] A. Mukherjee and B. Liu. Aspect extraction through semi-supervised modeling. In ACL, pages 339–348. Association for Computational Linguistics, 2012.
    Google ScholarLocate open access versionFindings
  • [24] D. Ramage, D. Hall, R. Nallapati, and C. D. Manning. Labeled lda: A supervised topic model for credit attribution in multi-labeled corpora. In EMNLP, pages 248–256. Association for Computational Linguistics, 2009.
    Google ScholarLocate open access versionFindings
  • [25] M. Rosen-Zvi, T. Griffiths, M. Steyvers, and P. Smyth. The author-topic model for authors and documents. In UAI, pages 487–494. AUAI Press, 2004.
    Google ScholarLocate open access versionFindings
  • [26] I. Titov and R. McDonald. Modeling online reviews with multi-grain topic models. In WWW, pages 111–120. ACM, 2008.
    Google ScholarLocate open access versionFindings
  • [27] H. M. Wallach. Topic modeling: beyond bag-of-words. In ICML, pages 977–984. ACM, 2006.
    Google ScholarLocate open access versionFindings
  • [28] H. M. Wallach, D. M. Mimno, and A. McCallum. Rethinking lda: Why priors matter. In NIPS, pages 1973–1981, 2009.
    Google ScholarLocate open access versionFindings
  • [29] C. Wang and D. M. Blei. Decoupling sparsity and smoothness in the discrete hierarchical dirichlet process. In NIPS, pages 1982–1989, 2009.
    Google ScholarLocate open access versionFindings
  • [30] H. Wang, Y. Lu, and C. Zhai. Latent aspect rating analysis on review text data: a rating regression approach. In KDD, pages 783–792. ACM, 2010.
    Google ScholarLocate open access versionFindings
  • [31] Q. Wang, J. Xu, H. Li, and N. Craswell. Regularized latent semantic indexing. In SIGIR, pages 685–694. ACM, 2011.
    Google ScholarLocate open access versionFindings
  • [32] S. Wang, Z. Chen, and B. Liu. Mining aspect-specific opinion using a holistic lifelong topic model. In WWW, pages 167–176. WWW, 2016.
    Google ScholarLocate open access versionFindings
  • [33] S. Williamson, C. Wang, K. A. Heller, and D. M. Blei. The ibp compound dirichlet process and its application to focused topic modeling. In ICML, pages 1151–1158, 2010.
    Google ScholarLocate open access versionFindings
  • [34] W. X. Zhao, J. Jiang, H. Yan, and X. Li. Jointly modeling aspects and opinions with a maxent-lda hybrid. In EMNLP, pages 56–65. Association for Computational Linguistics, 2010.
    Google ScholarLocate open access versionFindings
  • [35] J. Zhu and E. P. Xing. Sparse topical coding. arXiv preprint arXiv:1202.3778, 2012.
    Findings
Your rating :
0

 

Tags
Comments