Crawling and searching the hidden web

Crawling and searching the hidden web(2006)

引用 24|浏览6
暂无评分
摘要
An ever-increasing amount of valuable information on the Web today is hidden behind search interfaces. This information is collectively called the Hidden Web. In this dissertation, we study how we can effectively collect the data from the Hidden Web and enable the users to search for information within the collected data. More specifically, we address some of the main challenges involved in creating a search engine for the Hidden Web: Crawling the Hidden Web. We study how to build an effective Hidden-Web crawler that can facilitate the collection of information from the Hidden Web. Since there are no links to the Hidden Web pages, our crawler needs to automatically come up with queries to issue to the Hidden Web sites. We propose three different query generation policies for the Hidden Web: a policy that picks queries at random from a list of keywords, a policy that picks queries based on their frequency in a generic text collection, and a policy which adaptively picks a good query based on the content of the pages downloaded from the Hidden-Web site. We compare the effectiveness of our policies by crawling a number of real Hidden-Web sites. Updating the Hidden-Web pages. The information on the Web today is constantly evolving. Once our crawler has downloaded the information from the Hidden-Web, it needs to periodically refresh its local copy in order to enable users to search for up-to-date information. We study the evolution of searchable Web sites using real data collected from the Web over a period of one year. We also propose an efficient sampling-based policy for updating the pages. Indexing and searching the Hidden Web. Once we have downloaded the Hidden-Web pages, we can enable the users to search for useful information. Search engines typically do this by maintaining large-scale inverted indexes which are replicated dozens of times for scalability and which are then pruned in order to reduce the cost of operation. We show that the current approaches employed by the search engines may result in significant degradation in the quality of results. To alleviate this problem, we propose modifications to current pruning techniques so that we avoid any degradation in quality, while realizing the benefit of lower cost of operation. Fighting Web spam. In the last few years, several sites on the Web observe an ever-increasing portion of their traffic coming from search engine referrals. Given the large fraction of Web traffic originating from searches and the high potential monetary value of this traffic, some Web site operators try to influence the positioning of their pages within search results by crafting spam Web pages. In the case of the Hidden Web, these malicious Web site operators may try to pollute our index for their own benefit by injecting spam content in their Hidden-Web databases so that our crawler can download it. In this dissertation, we study the prevalence of spam on the Web and we present a number of techniques to detect Web spam. We also show how to use machine learning techniques to combine our techniques into creating a more effective spam detection mechanism. The techniques proposed in this dissertation have been incorporated in a prototype search engine that currently indexes a few million pages from the Hidden Web.
更多
查看译文
关键词
searchable Web site,malicious Web site operator,Web traffic,hidden web,Web spam,Web site operator,search engine,Fighting Web spam,Hidden Web site,Hidden Web page,Hidden Web
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要