A general minimal perfect hash function for canonical k-mers on arbitrary alphabets with an application to DNA sequences

biorxiv(2023)

引用 0|浏览0
暂无评分
摘要
To index or compare sequences efficiently, often k-mers, i.e., substrings of fixed length k, are used. In order to store them in a table, or to assign them to different tables or threads, k-mers are encoded as integers. One way to ensure an even distribution is to use minimal perfect hashing, i.e., a bijective mapping between all possible sigmak k-mers and the interval [0, sigma k-1], where sigma is the alphabet size. In many applications, e.g., when the reading direction of a DNA-sequence is ambiguous, \emph{canonical} k-mers are considered, i.e., the lexicographically smaller of a given k-mer and its reverse (or reverse complement) is chosen as a representative. In naive encodings, canonical k-mers are not evenly distributed within the interval [0, sigma k-1] hampering an even distribution to threads or tables. We present a minimal perfect hash function of canonical k-mers on alphabets of arbitrary size, i.e., a mapping to the interval [0, sigma k-1]. The approach is introduced for canonicalization under reversal and extended to canonicalization under reverse complementation. We further present a space and time efficient bit-based implementation for the DNA alphabet. ### Competing Interest Statement The authors have declared no competing interest.
更多
查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要