Efficient Distributed Smith-Waterman Algorithm Based on Apache Spark

2017 IEEE 10th International Conference on Cloud Computing (CLOUD)(2017)

引用 22|浏览32
暂无评分
摘要
The Smith-Waterman algorithm, which produces the optimal local alignment between pairwise sequences, is universally used as a key component in bioinformatics fields. It is more sensitive than heuristic approaches, but also more time-consuming. To speed up the algorithm, Single-Instruction Multiple-Data (SIMD) instructions have been used to parallelize the algorithm by leveraging data parallel strategy. However, SIMD-based Smith-Waterman (SW) algorithms show limited scalability. Moreover, the recent next-generation sequencing machines generate sequences at an unprecedented rate, so faster implementations of the sequence alignment algorithms are needed to keep pace. In this paper, we present CloudSW, an efficient distributed Smith-Waterman algorithm which leverages Apache Spark and SIMD instructions to accelerate the algorithm. To facilitate easy integration of distributed Smith-Waterman algorithm into third-party software, we provide application programming interfaces (APIs) service in cloud. The experimental results demonstrate that 1) CloudSW has outstanding performance and achieves up to 3.29 times speedup over DSW and 621 times speedup over SparkSW. 2) CloudSW has excellent scalability and achieves up to 529 giga cell updates per second (GCUPS) in protein database search with 50 nodes in Aliyun Cloud, which is the highest performance that has been reported as far as we know.
更多
查看译文
关键词
Distrubuted Smith–Waterman algorithm,SIMD instructions,Apache Spark,Scalability,Alluxio,HDFS
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要