Topic2features: A Novel Framework To Classify Noisy And Sparse Textual Data Using Lda Topic Distributions

PEERJ COMPUTER SCIENCE(2021)

引用 6|浏览9
暂无评分
摘要
In supervised machine learning, specifically in classification tasks, selecting and analyzing the feature vector to achieve better results is one of the most important tasks. Traditional methods such as comparing the features' cosine similarity and exploring the datasets manually to check which feature vector is suitable is relatively time consuming. Many classification tasks failed to achieve better classification results because of poor feature vector selection and sparseness of data. In this paper, we proposed a novel framework, topic2features (T2F), to deal with short and sparse data using the topic distributions of hidden topics gathered from dataset and converting into feature vectors to build supervised classifier. For this we leveraged the unsupervised topic modelling LDA (latent dirichlet allocation) approach to retrieve the topic distributions employed in supervised learning algorithms. We made use of labelled data and topic distributions of hidden topics that were generated from that data. We explored how the representation based on topics affect the classification performance by applying supervised classification algorithms. Additionally, we did careful evaluation on two types of datasets and compared them with baseline approaches without topic distributions and other comparable methods. The results show that our framework performs significantly better in terms of classification performance compared to the baseline(without T2F) approaches and also yields improvement in terms of F1 score compared to other compared approaches.
更多
查看译文
关键词
Classification, Machine learning, Topic analysis, Text analysis, Natural language processing, Sparse Data, Social media
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要