Jackknifing Documents and Additive Smoothing for Naive Bayes with Scarce Data

IEEE International Conference on DataMining(2015)

引用 2|浏览15
暂无评分
摘要
Naïve Bayes (NB) classifiers are well-suited to several applications owing to their easy interpretability and maintainability. However, text classification is often hampered by the lack of adequate training data. This motivates the question: how can we train NB more effectively whentraining data is very scarce?In this paper, we introduce an established subsampling techniquefrom statistics -- the jackknife -- into machine learning. Our approachjackknifes documents themselves to create new \"pseudo-documents.\" Theunderlying idea is that although these pseudo-documents do not havesemantic meaning, they are equally representative of the underlyingdistribution of terms. Therefore, they could be used to train any classifierthat learns this underlying distribution, namely, any parametric classifiersuch as NB (but not, for example, non-parametric classifiers such as SVMand k-NN). Furthermore, the marginal value of this additional trainingdata should be the highest precisely when the original data is inadequate. We then show that our jackknife technique is related to the questionof additively smoothing NB via an appropriately defined notion of\"adjointness.\" This relation is surprising since it connects a statisticaltechnique for handling scarce data to a question about the NB model. Accordingly, we are able to shed light on optimal values of the smoothingparameter for NB in the very scarce data regime. We validate our approach on a wide array of standard benchmarks -- both binary and multi-class -- for two event models of multinomial NB. Weshow that the jackknife technique can dramatically improve the accuracyfor both event models of NB in the regime of very scarce training data. Inparticular, our experiments show that the jackknife can make NB moreaccurate than SVM for binary problems in the very scarce training dataregime. We also provide a comprehensive characterization of the accuracyof these important classifiers (for both binary and multiclass) in the veryscarce data regime for benchmark text datasets, without feature selectionand class imbalance.
更多
查看译文
关键词
Jackknife Subsampling, Scarce Data, Naive Bayes, Multinomial Event Models, Comparison between Naive Bayes and SVM
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要