TopicBank: Collection of coherent topics using multiple model training with their further use for topic model validation

Vasiliy Alekseev,Evgeny S. Egorov,Konstantin V. Vorontsov,Alexey Goncharov,Kaidar Nurumov,Timur Buldybayev

Data and Knowledge Engineering（2021）

引用 2|浏览17

暂无评分

摘要

Probabilistic topic modeling of a text collection is a tool for unsupervised learning of the inherent thematic structure of the collection. Given only the text of documents as input, the topic model aims to reveal latent topics as probability distributions over words. The shortcomings of topic models are that they are unstable in the sense that topics may depend on the random initialization, and incomplete in the sense that each new run of the model on the same collection may discover some new topics. This means that data exploration using topic modeling usually requires too many experiments for looking over many topic models and tuning their parameters in search of a model that describes the data best. To deal with the instability and incompleteness of topic models, we propose to gradually accumulate interpretable topics in a “topic bank” using multiple model training. To add topics into the bank, we learn a child level in a hierarchical topic model, then we analyze the coherence of child subtopics and their relationships with parent bank topics in order to exclude irrelevant and duplicate subtopics instead of adding them to the bank. Then we introduce a new way to topic model evaluation by comparing the topics found by the model with the ones that were collected beforehand in a bank. Our experiments with several datasets and topic models show that the proposed method does help in finding a model with more interpretable topics.

查看译文

关键词

Topic modeling,Multiple model training,Topic coherence,Stability,Regularization

AI 理解论文

溯源树

样例

生成溯源树，研究论文发展脉络

Chat Paper

正在生成论文摘要