Accelerating Comparative Genomics Workflows In A Distributed Environment With Optimized Data Partitioning And Workflow Fusion

Scalable Computing: Practice and Experience(2015)

引用 7|浏览1
暂无评分
摘要
The advent of next generation sequencing technology has generated massive amounts of biological data at unprecendented rates. Comparative genomics applications often require compute-intensive tools for subsequent analysis of high throughput data. Although cloud computing infrastructure plays an important role in this respect, the pressure from such computationally expensive tasks can be further alleviated using efficient data partitioning and workflow fusion. Here, we implement a workflow-based model for parallelizing the data-intensive tasks of genome alignment and variant calling with BWA and GATK's HaplotypeCaller. We explore three different approaches of partitioning data, granularity-based, individual-based, and alignment-based, and how each affect the run time. We observe granularity-based partitioning for BWA and alignment-based partitioning for HaplotypeCaller to be the optimal choices for the pipeline. We further discuss the methods and impact of workflow fusion on performance by considering different levels of fusion and how it affects our results. We identify the various open problems encountered, such as understanding the extent of parallelism, using heterogenous environments without a shared file system, and determining the granularity of inputs, and provide insights into addressing them. Finally, we report significant performance improvements, from 12 days to under 2 hours while running the BWA-GATK pipeline using partitioning and fusion.
更多
查看译文
关键词
genome alignment,variant calling,workflow fusion,data partitioning,performance
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要