MRSMRS: Mining repetitive sequences in a MapReduce setting

BIBM(2014)

引用 2|浏览24
暂无评分
摘要
Recent research suggests DNA repeats play critical roles in cellular regulatory functions and disease development. Also, repeat variability among different species, or the same species, is an important indicator for the development of specific phenotypes. Similarities in repetitive sequences among different species have been shown to indicate deeply conserved functions. Patterns such as ultra conserved elements (UCEs), tandem repeats, and palindromes have been of interest. Researchers utilize various computational approaches to aid in the identification of each of these types of patterns. The challenge associated with identifying repeats across a collection of genomes arises from the amount of data stored within DNA. The human genome alone consists of more than 3.1 billion base pairs, and intermediate data generated by alignment- and hash-based approaches are substantial. This sort of all-against-all analysis on a large collection of genomic sequence data often requires data to be reprocessed when new genomes are collected. To handle data of this scale, we utilize the Hadoop Distributed File System running on a cluster of 11 relatively inexpensive nodes, each containing a quad-core commodity processor. Furthermore, to alleviate redundant computation, intermediate data are organized in HBase, allowing us to incrementally process new genomic data without having to reprocess existing genomes. Our approach lends a cost-effective, flexible, robust, and scalable solution to the challenge of identifying various types of repetitive sequences across a collection of genomes. In this study, we benchmark our method using a collection of 6 genomes, summing to an approximate total of 14.2 billion base pairs. Three case studies are presented, demonstrating a 10.4 times speedup over previous state-of-the-art approaches and linear scalability.
更多
查看译文
关键词
alignment-based approaches,human genome,specific phenotypes development,parallel processing,linear scalability,deeply conserved functions,cellular regulatory functions,mining repetitive sequences-in-a-mapreduce setting,tandem repeats,genomics,scalable solution,cluster,quad-core commodity processor,molecular biophysics,molecular configurations,hash-based approaches,genomic sequence data,mrsmrs,computational approaches,repetitive sequences,intermediate data generation,disease development,dna repeats,sequence analysis,data mining,redundant computation,big data,dna,palindromes,bioinformatics,hadoop distributed file system,repeat variability,ultraconserved elements,molecular clusters
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要