Highly discriminative statistical features for email classification

Knowledge and Information Systems(2011)

引用 44|浏览0
暂无评分
摘要
This paper reports on email classification and filtering, more specifically on spam versus ham and phishing versus spam classification, based on content features. We test the validity of several novel statistical feature extraction methods. The methods rely on dimensionality reduction in order to retain the most informative and discriminative features. We successfully test our methods under two schemas. The first one is a classic classification scenario using a 10-fold cross-validation technique for several corpora, including four ground truth standard corpora: Ling-Spam, SpamAssassin, PU1, and a subset of the TREC 2007 spam corpus, and one proprietary corpus. In the second schema, we test the anticipatory properties of our extracted features and classification models with two proprietary datasets, formed by phishing and spam emails sorted by date, and with the public TREC 2007 spam corpus. The contributions of our work are an exhaustive comparison of several feature selection and extraction methods in the frame of email classification on different benchmarking corpora, and the evidence that especially the technique of biased discriminant analysis offers better discriminative features for the classification, gives stable classification results notwithstanding the amount of features chosen, and robustly retains their discriminative value over time and data setups. These findings are especially useful in a commercial setting, where short profile rules are built based on a limited number of features for filtering emails.
更多
查看译文
关键词
Data mining,Dimensionality reduction,Email classification,Feature extraction,Feature selection
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要