Space-Efficient Substring Occurrence Estimation

MOD(2014)

引用 6|浏览0
暂无评分
摘要
In this paper we study the problem of estimating the number of occurrences of substrings in textual data: A text T on some alphabet =[σ ] of length n is preprocessed and an index ℐ is built. The index is used in lieu of the text to answer queries of the form 𝖢𝗈𝗎𝗇𝗍≈ (P) , returning an approximated number of the occurrences of an arbitrary pattern P as a substring of T . The problem has its main application in selectivity estimation related to the LIKE predicate in textual databases. Our focus is on obtaining an algorithmic solution with guaranteed error rates and small footprint. To achieve that, we first enrich previous work in the area of compressed text-indexing providing an optimal data structure that, for a given additive error ℓ≥ 1 , requires ( n/ℓlogσ) bits. We also approach the issue of guaranteeing exact answers for sufficiently frequent patterns, providing a data structure whose size scales with the amount of such patterns. Our theoretical findings are supported by experiments showing the practical impact of our data structures.
更多
查看译文
关键词
Compressed full-text indexing, Pattern matching, Full-text indexing, Data structures
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要