Boosting the Feature Space: Text Classification for Unstructured Data on the Web

ICDM(2006)

引用 21|浏览29
暂无评分
摘要
The issue of seeking efficient and effective methods for classifying unstructured text in large document corpora has received much attention in recent years. Traditional document representation like bag-of-words encodes documents as feature vectors, which usually leads to sparse feature spaces with large dimensionality, thus making it hard to achieve high classification accuracies. This paper addresses the problem of classifying unstructured documents on the Web. A classification approach is proposed that utilizes traditional feature reduction techniques along with a collaborative filtering method for augmenting document feature spaces. The method produces feature spaces with an order of magnitude less features compared with a baseline bag-of-words feature selection method. Experiments on both real-world data and benchmark corpus indicate that our approach improves classification accuracy over the traditional methods for both support vector machines and AdaBoost classifiers.
更多
查看译文
关键词
feature reduction technique,unstructured text classification,unstructured data,feature vector,adaboost classifier,document feature space augmentation,bag-of-words encodes document,document corpora,information filtering,text classification,feature extraction,collaborative filtering method,support vector machine,internet,high classification accuracy,augmenting document feature space,classification approach,web,classification accuracy,classification,large document corpus,feature space,bag-of-words feature selection method,text analysis,utilizes traditional feature reduction,effective method,bag of words,feature selection
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要