Index compression and redundancy elimination in large textual collections

Index compression and redundancy elimination in large textual collections(2010)

引用 23|浏览12
暂无评分
摘要
Large search engines process thousands of queries per second against their collections of billions of web pages. They often build inverted indexes for their collections to speed up query processing. The rapidly growing inverted index size has been one of the most important challenges for search engines in the past decade. Search engines use highly optimized compression schemes to decrease inverted index size and improve query throughput. Many index compression techniques have been studied in the literature. Although millions of new web pages need to be downloaded by search engines every day, a considerable proportion of them share a lot of content. This results in a huge amount of data redundancy in both web pages and inverted indexes. The redundancy of the inverted indexes may significantly slow down query processing. Although many index compression methods have tried to reduce redundancy within page and postings, they could be improved significantly by taking better advantages of similarities between web pages. Also, most previous work has focused on compressing docID and frequency information stored in the index. However, it is also very important to compress position information in the index, since its size is much larger than that of docIDs or frequencies. In this thesis, we focus on inverted index compression and query processing techniques. We study compression techniques for docIDs and frequencies with optimized document reordering techniques, which exploit the similarities between web pages. We also study the compression of position data in inverted indexes. In addition, we study file synchronization techniques that reduce redundant data transfer over networks. Search engines can use such techniques to save a large amount of network bandwidth. Our experimental results show that our techniques can significantly improve the search engine performance.
更多
查看译文
关键词
redundancy elimination,compression technique,inverted index size,large textual collection,web page,index compression method,inverted index compression,search engine,index compression technique,query processing,large search engines process,inverted index
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要