FQCSpark: Efficient Spark-based Parallel Compression Algorithm for FASTQ Genome Sequences

International Conference on Computer Supported Cooperative Work in Design (CSCWD)(2022)

引用 0|浏览17
暂无评分
摘要
The rapid development of Next-Generation Sequencing (NGS) technologies has posed serious challenges to the storage and transmission of genomic data, and the bioinformatics community urgently needs efficient genome compression algorithms to support genome analysis. The existing acceleration approaches for genome compression algorithms are mostly multithreading and limited to a single machine, which cannot adapt to the demand of large-scale genome compression in the distributed environment of cloud computing. In this paper, we propose a Spark-based efficient parallel compression algorithm for FASTQ genome sequences - FQCSpark. Experimental results show that FQCSpark outperforms existing algorithms with good compression ratios by several times in speed. This is due to the fine-grained degree of parallelism design and well-designed parallel operator flow. More importantly, the degree of parallelism design in this paper is also applicable to other algorithms which compress blocks independently. Meanwhile, FQCSpark provides good compression ratios, especially on the S.cerevisiae dataset, which is 15.4% better than the latest open-source genome compression tool - Genozip. FQCSpark is the first known Spark-based parallel compression algorithm for FASTQ genome sequences.
更多
查看译文
关键词
genome compression,FASTQ,spark,parallel computing,distributed computing
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要