Indexing and querying over versioned text

Indexing and querying over versioned text(2013)

引用 22|浏览16
暂无评分
摘要
In recent years, the world-wide web has become a universal repository for storing human knowledge and culture. Any individuals or companies can create their own web page, share ideas, upload pictures, and publish videos. As it has become so popular, the web creates problems of its own. It is becoming more and more difficult for web users to find useful information among the billions of pages on the web. As a solution, web search engines were designed to help find information on the web. Large web search engines now have to process thousands of queries per second over tens of billions of documents, resulting in very significant hardware and energy costs. Query processing algorithms in these engines are based on inverted index structures, and a large amount of research over the last decade has focused on how to better organize, compress, and access such indexes. This has contributed to significant increases in algorithmic efficiency that, together with increases in CPU speeds and counts, have allowed the major engines to keep up with increasing user demands. In this thesis, we focus on versioned document collections in which each document is represented by multiple versions. Important examples are Wikipedia, where each edit history for every page is stored, the Internet Archive (IA), or source code and documents maintained in revision control systems. Versioned document collections can become very large, due to the need to retain past versions, but there is also a lot of redundancy between versions that can be exploited. In addition, such collections may cover a very long time period. Thus, versioned document collections are usually stored using special differential (delta) compression techniques. We study how to create highly compressed full-text index structures and perform efficient query processing for versioned document collections. In particular, we propose a framework for indexing and querying versioned document collections that enables fast top-k query processing. Within this framework, we propose new index compression techniques for both non-positional and positional index structures for versioned document collections. Experimental results show that these techniques not only significantly reduce index size over previous methods, but also achieve faster top-k query processing. We also study how to support temporal range queries in versioned document collections. Such search queries over versioned document collections often use keywords as well as temporal constraints, most commonly a time range of interest. We show how to achieve high query throughput by using smart index partitioning techniques that take index compression into account. Experiments on 85 million document versions show that queries can be executed in less than 5 milliseconds on memory-based index structures, and in only slightly more time on disk-based structures. We also show how to efficiently support the recently proposed stable top-k search primitive on top of our schemes.
更多
查看译文
关键词
versioned text,top-k query processing,new index compression technique,inverted index structure,full-text index structure,index size,memory-based index structure,versioned document collection,index compression,large web search engine,million document version
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要