An Efficient Parallel Sketch-based Algorithmic Workflow for Mapping Long Reads

Tazin Rahman, Oieswarya Bhowmik,Ananth Kalyanaraman

biorxiv(2023)

引用 0|浏览1
暂无评分
摘要
Long read technologies are continuing to evolve at a rapid pace, with the latest of the high fidelity technologies delivering reads over 10Kbp with high accuracy (99.9%). Classical long read assemblers produce assemblies directly from long reads. Hybrid assembly workflows provide a way to combine partially constructed assemblies (or contigs) with newly sequenced long reads in order to generate improved and near-complete genomic scaffolds. Under either setting, the main computational bottleneck is the step of mapping the long reads\---|against other long reads or pre-constructed contigs. While many tools implement the mapping step through alignments and overlap computations, alignment-free approaches have the benefit of scaling in performance. Designing a scalable alignment-free mapping tool while maintaining the accuracy of mapping (precision and recall) is a significant challenge. In this paper, we visit the generic problem of mapping long reads to a database of subject sequences, in a fast and accurate manner. More specifically, we present an efficient parallel algorithmic workflow, called JEM-mapper, that uses a new minimizer-based Jaccard estimator (or JEM) sketch to perform alignment-free mapping of long reads. For implementation and evaluation, we consider two application settings: (i) the hybrid scaffolding setting, where the goal is to map a large collection of long reads to a large collection of partially constructed assemblies or contigs; and (ii) the classical long read assembly setting, where the goal is to map long reads to one another to identify overlapping long reads. Our algorithms and implementations are designed for execution on distributed memory parallel machines. Experimental evaluation shows that our parallel algorithm is highly effective in producing high-quality mapping while significantly improving the time to solution compared to state-of-the-art mapping tools. For instance, in the hybrid setting for a large genome Betta splendens (350Mbp genome) with 429K HiFi long reads and 98K contigs, JEM-mapper produces a mapping with 99.41% precision and 97.91% recall, while yielding 6.9x speedup over a state-of-the-art mapper. ### Competing Interest Statement The authors have declared no competing interest.
更多
查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要