Kmerind: A Flexible Parallel Library For K-Mer Indexing Of Biological Sequences On Distributed Memory Systems

BCB(2016)

引用 17|浏览104
暂无评分
摘要
Counting and indexing fixed length substrings, or k-mers, in biological sequences is a key step in many bioinformatics tasks including genome alignment and mapping, genome assembly, and error correction. While advances in next generation sequencing technologies have dramatically reduced the cost and improved latency and throughput, there exist few bioinformatics tools and libraries that can efficiently process the data sets at the current generation rate of 1.8 terabases every 3 days. We present Kmerind, a high performance k-mer indexing library for distributed memory environments. The Kmerind library provides a set of simple and consistent APIs with sequential semantics and parallel implementations that are designed to be flexible and extensible. Using Kmerind, a user can easily instantiate application-specific indices, such as k-mer counter and position index, from biult-in or user-supplied components without extensive high performance computing expertise. Kmerind's k-mer counter performs similarly or better than existing, best-inclass k-mer counting tools even on shared memory systems. In a distributed memory environment, Kmerind counts k-mers in a 120 GB sequence read data set in less than 13 seconds on 1024 Xeon CPU cores, and fully indexes their positions in approximately 17 seconds. Querying for 1% of the k-mers in these indices can be completed in 0.23 seconds and 28 seconds, respectively. To our knowledge, Kmerind is the first k-mer indexing library for distributed memory environments, and the first fully customizable and extensible library for general k-mer indexing and counting. Kmerind is available from https://github.com/ParBLiSS/kmerind.
更多
查看译文
关键词
k-mer counting,k-mer index,next generation sequencing,distributed computing,parallel computing,MPI,SIMD
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要