Improving web spam classification using rank-time features

AIRWeb(2007)

引用 60|浏览0
暂无评分
摘要
In this paper, we study the classification of web spam. Web spam refers to pages that use techniques to mislead search engines into assigning them higher rank, thus increasing their site traffic. Our contributions are two fold. First, we find that the method of datset construction is crucial for accurate spam classification and we note that this problem occurs generally in learning problems and can be hard to detect. In particular, we find that ensuring no overlapping domains between test and training sets is necessary to accurately test a web spam classifier. In our case, classification performance can differ by as much as 40% in precision when using non-domain-separated data. Second, we show rank-time features can improve the performance of a web spam classifier. Our paper is the first to investigate the use of rank-time features, and in particular query-dependent rank-time features, for web spam detection. We show that the use of rank-time and query-dependent features can lead to an increase in accuracy over a classifier trained using page-based content only.
更多
查看译文
关键词
use technique,particular query-dependent rank-time feature,web spam detection,improving web spam classification,query-dependent feature,accurate spam classification,rank-time feature,web spam classifier,datset construction,classification performance,web spam,self similarity,spam,data mining,search engine,topology,web pages
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要