An optimal approach for text feature selection

COMPUTER SPEECH AND LANGUAGE(2022)

引用 3|浏览21
暂无评分
摘要
Traditionally, feature selection is conducted by first deriving a candidate list of features, then ranking and selecting the top features based on predefined threshold. These methods are highly dependent on the choice of the threshold, and therefore lead to sub-optimal text categorization results. In this paper, we address the selection problem by suggesting a one-step method designed to optimally select the subset of features. The selection is formulated mathematically as an optimization problem with the objective of maximizing classification accuracy while simultaneously deriving and choosing the most discriminative features. Our method, MFX, is applicable to many of the conventional methods, with two distinguishing aspects. First, it is based on considering all documents from the same category as one extended document, instead of analyzing individual documents. Second, it considers choosing the most discriminative terms that are frequent and common across all documents of the same category, and minimally present in other categories. Moreover, MFX is language-independent. It was tested on the well-known benchmark Reuters RCV1 dataset. To showcase its language independence, MFX was also tested on Arabic datasets extracted from Arabic news sources. The results indicated that MFX always performed similar to or better than other well-known feature selection methods. MFX with a Support Vector Machine (SVM) classifier was also shown to outperform recent text classification algorithms based on neural networks and word embeddings.
更多
查看译文
关键词
Feature selection,Text categorization,Text mining,Data mining,Arabic text mining
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要