An Efficient Parallel Sketch-based Algorithm for Mapping Long Reads to Contigs

IPDPS Workshops(2023)

引用 0|浏览2
暂无评分
摘要
Long read technologies are continuing to evolve at a rapid pace, with the latest of the high fidelity technologies delivering reads over 10Kbp with high accuracy (99.9%). However, there also exist partially constructed assemblies using short read data. Hybrid assembly workflows provide a way to combine the information in both these data sources and generate highly improved and near complete assemblies and genomic scaffolds. In this paper, we address the problem of mapping long reads to contigs (representing prior constructed partial assemblies). This is a many-to-many comparison application. However, brute force comparisons of all pairs is not practical. Therefore, in this paper, we present a parallel, alignment-free sketching-based algorithm that efficiently maps long reads to contigs. More specifically, our approach uses a minimizer-based Jaccard estimator (or JEM), a variant of the classical MinHashing technique, as its sketch. Experimental evaluation shows that our parallel algorithm is highly effective in producing a high quality mapping while improving significantly the time to solution compared to state-of-the-art mapping tools. For instance, for a large genome Betta splendens (approximate to 350Mbp genome) with 429K HiFi long reads and 98K contigs, our JEM approach produces a mapping with 99.31% precision and 96.18% recall, while yielding 7.13x speedup over a state-of-the-art mapper (Mashmap).
更多
查看译文
关键词
hybrid assembly,long read mapping,sketching,MinHashing,parallel algorithms
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要