A Practical Implementation of Compressed Suffix Arrays with Applications to Self-Indexing

DCC(2014)

引用 17|浏览11
暂无评分
摘要
In this paper we develop a simple and practical text indexing scheme for compressed suffix arrays (CSA). For a text of n characters, our CSA can be constructed in linear time and needs 2nHk + n + o(n) bits of space for any k ≤ clogσn - 1 and any constant c <; 1, where Hk denotes the kth order entropy. We compare the performance of our method with two established compressed indexing methods, the FM-index and the Sad-CSA. Experiments on the Canterbury Corpus and the Pizza&Chili Corpus show significant advantages of our algorithm over two other indexes in terms of compression and query time. Our storage scheme achieves better performance on all types of data present in these two corpora, except for evenly distributed data, such as DNA. The source code for our CSA is available online.
更多
查看译文
关键词
pizza&chili corpus,linear time,sad-csa,compressed indexing methods,data structures,distributed data,indexing,query time,computational complexity,canterbury corpus,fm-index,source code,dna,text analysis,text indexing scheme,compressed suffix arrays,kth order entropy,decoding,distributed databases,encoding,entropy,fm index,indexes
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要