Pipelined Multi-FPGA Genomic Data Clustering.

Rick Wertenbroek,Enrico Petraglio,Yann Thoma

ICA3PP（2017）

引用 23|浏览4

暂无评分

摘要

High throughput DNA sequencing made individual genome profiling possible and produces very large amounts of data. Today data and associated metadata are stored in FASTQ text file assemblies carrying the information of genome fragments called reads. Current techniques rely on mapping these reads to a common reference genome for compression and analysis. However, about 10% of the reads do not map to any known reference making them difficult to compress or process. These reads are of high importance because they hold information absent from any reference. Finding overlaps in these reads can help subsequent processing and compression tasks tremendously. Within this context clustering is used to find overlapping unmapped reads and sort them in groups. Clustering being an extremely time consuming task a modular multi-FPGA pipeline was designed and is the focus of this paper. A pipeline with 6 FPGAs was created and has shown a speed-up of (times 5) compared to existing FPGA implementations. Resulting enriched files encoding reads and clustering results show file sizes within a 10% margin of the best DNA compressors while providing valuable extra information.

查看译文

关键词

FPGA, Acceleration, Genomic data, Clustering, Compression

AI 理解论文

溯源树

样例

生成溯源树，研究论文发展脉络

Chat Paper

正在生成论文摘要