Do not crawl in the DUST: different URLs with similar text

Proceedings of the 15th international conference on World Wide Web(2006)

引用 18|浏览2
暂无评分
摘要
We consider the problem of dust: Different URLs with Similar Text. Such duplicate URLs are prevalent in web sites, as web server software often uses aliases and redirections, translates URLs to some canonical form, and dynamically generates the same page from various different URL requests. We present a novel algorithm, DustBuster, for uncovering dust; that is, for discovering rules for transforming a given URL to others that are likely to have similar content. DustBuster is able to detect dust effectively from previous crawl logs or web server logs, without examining page contents. Verifying these rules via sampling requires fetching few actual web pages. Search engines can benefit from this information to increase the effectiveness of crawling, reduce indexing overhead as well as improve the quality of popularity statistics such as PageRank.
更多
查看译文
关键词
mining,duplicates,similar text,different urls,various different url request,dustbuster mine,rules,page content,web server log,web site,duplicate urls,similarity,indexing overhead,translates urls,actual web page,web server software,indexation,crawling,url normalization,web pages,search engine,search engines
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要