Weblog classification for fast splog filtering: a URL language model segmentation approach

HLT-NAACL(2006)

引用 38|浏览8
暂无评分
摘要
This paper shows that in the context of statistical weblog classification for splog filtering based on n-grams of tokens in the URL, further segmenting the URLs beyond the standard punctuation is helpful. Many splog URLs contain phrases in which the words are glued together in order to avoid splog filtering techniques based on punctuation segmentation and unigrams. A technique which segments long tokens into the words forming the phrase is proposed and evaluated. The resulting tokens are used as features for a weblog classifier whose accuracy is similar to that of humans (78% vs. 76%) and reaches 93.3% of precision in identifying splogs with recall of 50.9%.
更多
查看译文
关键词
statistical weblog classification,url language model segmentation,splog urls,weblog classification,fast splog,segments long token,weblog classifier,punctuation segmentation,standard punctuation,language model
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要