The classification power of Web features Version 1 . 0

Miklós Erdélyi,András A. Benczúr, Bálint Daróczy, András Garzó,Tamás Kiss, Dávid Siklósi

semanticscholar(2013)

引用 0|浏览1
暂无评分
摘要
In this paper we give a comprehensive overview of features devised for Web spam detection and investigate how much various classes, some requiring very high computational effort, add to the classification accuracy. • We collect and handle a large number of features based on recent advances in Web spam filtering, including temporal ones, in particular we analyze the strength and sensitivity of linkage change. • We propose new temporal link similarity based features and show how to compute them efficiently on large graphs. • We show that machine learning techniques including ensemble selection, LogitBoost and Random Forest significantly improve accuracy. • We conclude that, with appropriate learning techniques, a simple and computationally inexpensive feature subset outperforms all previous results published so far on our data set and can only slightly be further improved by computationally expensive features. • We test our method on three major publicly available data sets, the Web Spam Challenge 2008 data set WEBSPAM-UK2007, the ECML/PKDD Discovery Challenge data set DC2010 and the Waterloo Spam Rankings for ClueWeb09. ∗This work was supported in part by the EC FET Open project “New tools and algorithms for directed network analysis” (NADINE No 288956), by the EU FP7 Project LAWA—Longitudinal Analytics of Web Archive Data, OTKA NK 105645 and by the European Union and the European Social Fund through project FuturICT.hu (grant no.: TAMOP-4.2.2.C-11/1/KONV-2012-0013). The research was carried out as part of the EITKIC 12-1-2012-0001 project, which is supported by the Hungarian Government, managed by the National Development Agency, financed by the Research and Technology Innovation Fund and was performed in cooperation with the EIT ICT Labs Budapest Associate Partner Group. (www.ictlabs.elte.hu) This paper is a comprehensive comparison of the best performing classification techniques based on [9, 37, 36, 38] and new experiments.
更多
查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要