Function-Assigned Masked Superstrings as a Versatile and Compact Data Type for 𝑘-Mer Sets

Ondřej Sladký,Pavel Veselý,Karel Břinda

biorxiv(2024)

引用 0|浏览1
暂无评分
摘要
The exponential growth of genome databases calls for novel space-efficient algorithms for data compression and search. State-of-the-art approaches often rely on 𝑘-merization for data tokenization, yet efficiently representing and querying 𝑘-mer sets remains a significant challenge in bioinformatics. Our recent work has introduced the concept of masked superstring for compactly representing 𝑘-mer sets, designed without reliance on common structural assumptions on 𝑘-mer data. However, despite their compactness, the practicality of masked superstrings for set operations and membership queries was previously unclear. Here, we propose the 𝑓-masked superstring framework, which additionally integrates demasking functions 𝑓, enabling efficient 𝑘-mer set operations through concatenation. When combined with the FMS-index, a new index for 𝑓-masked superstrings based on a simplified FM-index, we obtain a versatile, compact data structure for 𝑘-mer sets. We demonstrate its power through the FMSI program, which, when evaluated on bacterial pan-genomic data, achieves memory savings of a factor of 3 to 10 compared to state-of-the-art single 𝑘-mer-set indexing methods such as SBWT and CBL. Our work presents a theoretical framework with promising practical advantages such as space-efficiency, demonstrating the potential of 𝑓-masked superstrings in 𝑘-mer-based methods as a generic data type. ### Competing Interest Statement The authors have declared no competing interest.
更多
查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要