Taxonomic classification with maximal exact matches in KATKA kernels and minimizer digests
CoRR(2024)
摘要
For taxonomic classification, we are asked to index the genomes in a
phylogenetic tree such that later, given a DNA read, we can quickly choose a
small subtree likely to contain the genome from which that read was drawn.
Although popular classifiers such as Kraken use k-mers, recent research
indicates that using maximal exact matches (MEMs) can lead to better
classifications. For example, we can build an augmented FM-index over the the
genomes in the tree concatenated in left-to-right order; for each MEM in a
read, find the interval in the suffix array containing the starting positions
of that MEM's occurrences in those genomes; find the minimum and maximum values
stored in that interval; take the lowest common ancestor (LCA) of the genomes
containing the characters at those positions. This solution is practical,
however, only when the total size of the genomes in the tree is fairly small.
In this paper we consider applying the same solution to three lossily
compressed representations of the genomes' concatenation: a KATKA kernel, which
discards characters that are not in the first or last occurrence of any
k_max-tuple, for a parameter k_max; a minimizer digest; a KATKA
kernel of a minimizer digest. With a test dataset and these three
representations of it, simulated reads and various parameter settings, we
checked how many reads' longest MEMs occurred only in the sequences from which
those reads were generated (“true positive” reads). For some parameter
settings we achieved significant compression while only slightly decreasing the
true-positive rate.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要