Positional Inverted Self-Index

2016 Data Compression Conference (DCC)(2016)

引用 1|浏览19
暂无评分
摘要
Summary form only given. We address the problem of positional indexing in natural language domain. The positional inverted index contains the information of the word positions. Thus, it is able to recover the original textfile, which implies that it is not necessary to store the originalfile. Our Positional Inverted Self-Index (PISI) stores the word position gaps encoded by variable byte code. The inverted lists of single terms are combined into one inverted list that represents a backbone of the text file since it stores the sequence of the indexed words of the original file. The inverted list is synchronized with presentation layer that stores separators, stopwords, as well as variants of the indexed words. The Huffman coding is used to encode the presentation layer. Our experiments prove that PISI is not far from standard positional inverted index in terms of search speed and, at the same time, it is more effective in memory consumption. PISI also proved that it is significantly faster than its close competitor FWCSA in terms of search speed at the same level of memory consumption.PISI naturally undergoes all usual procedures during the construction phase.The indexed text is case folded (all letters are reduced to lower case), stopped(so-called stopwords are omitted) and stemmed (all words are reduced to theirstems using Porter stemming algorithm). PISI uses its presentation layer (proposed by Farina et al. [1]) to store the information lost during the aforementioned procedures. The presentation layer contains one (possibly empty) slot for every word of the inverted list. The slot is composed of the Huffman codes of all non-alphanumeric words and all stopwords preceding the corresponding indexed word. We compared three different indexes in the experimental part: our PISI, word-based self-index FWCSA proposed by Fari~na et al. in [1] and standard positional inverted index II. The fastest instance of PISI with achieved compression ratio 42:91 % proved to be 23 times slower than II (with snippet time 2:37108 second per extracted character for 1 000 extracted words). Furthermore, PISI proved to achieve usually an order of magnitude better snippet time at some level of compression ratio in comparison to FWCSA. E.g. PISI with compression ratio 42:91 % achieves snippet time 3:84108 second per extracted character for 10 extracted words. On the other hand, FWCSA with compression ratio 42:85 % achieves snippet time 1:25 106 second per extracted character for 10 extracted words. Finally, PISI is able to achieve the best compression ratio among of all tested algorithms, which is 39:73 %.
更多
查看译文
关键词
positional inverted self-index,natural language domain,inverted index,word positions,PISI,word position gaps,variable byte code,text file,indexed words,Huffman coding,search speed,memory consumption,construction phase,indexed text,Porter stemming algorithm,Huffman codes,non alphanumeric words,indexed word,compression ratio
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要