Mining the web to improve search engine performance

Mining the web to improve search engine performance(2008)

引用 23|浏览14
暂无评分
摘要
Acting as a repository of a huge amount of information, the World Wide Web is becoming increasingly important to most people's daily lives. The mass of information on the web becomes more useful when it can be discovered and retrieved by Web users. Web search engines were designed for this need to locate relevant information on the Web. The main technology used in Web search engines is called data mining. Unlike mining through traditional data sources such as books or newspapers, Web data mining includes much more challenging retrieval tasks due to the fact that the Web is a more chaotic and unpredictatable environment. Our work mainly focuses on information retrieval based technologies to mine the data on the Web. The goal is to enhance users' search experience, by improving both the quality and the efficiency of a search engine.The basic structure of a search engine has been well established in the past decade. Overall, it can be split into three components: crawling, indexing and query processing. Crawling is an iterative process which starts with a set of initial seed pages and finishes after a reasonable portion of the Web pages are downloaded. By parsing through the crawled data, an inverted index is created as the data structure to store for each query term the Ids of all documents where it occurs. When users submit queries, the results are retrieved by going through the inverted index and ranked according to some measure of relevance. This process is called query processing.Most ordinary queries result in hundreds of thousands of results. Given that any human user will only be able to examine the first few results, the capability of returning the most relevant pages on the top is very important to a search engine's success. In this thesis, we first discuss the link-based ranking algorithms as the key techniques used to identify relationships between Web pages, based on a Web graph (or hyperlink) structure. In particular, we address the problem of approximating the PageRank values of individual pages in scenarios when the global computation is not available.Given that the quality of search results is the key factor to a search engine's success, we then study Web spam detection techniques by conducting machine learning based algorithms. In particular, we first develop a baseline classifier to identify spam Web sites and non-spam ones by using machine learning techniques. Then we improve the results of this baseline classifier by adding a second-level heuristic or secondary classifier that uses the baseline classification results for neighboring sites in order to flip the labels of certain sites. Running on a large data set from a real Web domain, our results showed promising improvements compared to previous spam detection methods.We conduct a geographic analysis towards a real world query log in order to obtain better understanding of users' geographic intents inside their searches. In particular, our objective is to make improvements to location based search engines by analyzing click through data extracted from a general query log. We present a detailed study of geographic search queries, a new taxonomy for such queries, and show experiments that relate such queries to the sites that are visited and the users that issue them.Large Web search engines need to be able to process thousands of queries per second on collections of billions of Web pages. As a result, query processing is a major performance bottleneck and cost factor in current search engines. We study caching as one of the methods employed to increase query throughput. In particular, we study the problem of result caching, in the sense that previously asked queries are cached so that results can be returned without executing the same queries over and over again. We first study result caching as a weighted caching problem. Different from previous work aiming to maximize the number of queries that are served directly from cache, we treat queries as retrieved at different costs. We describe a new set of feature-based cache eviction policies that achieve significant improvements in hit ratio over existing methods. In particular, these policies significantly narrow the gap between the best known algorithm and the upper bound given by the clairvoyant algorithm.
更多
查看译文
关键词
Web spam detection technique,search engine performance,Web graph,Web user,real Web domain,Large Web search engine,search engine,web search engine,query processing,Web page,World Wide Web
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要