Scalable Framework for the Analysis of Population Structure Using the Next Generation Sequencing Data.

Lecture Notes in Artificial Intelligence(2017)

引用 1|浏览6
暂无评分
摘要
Genomic variant data obtained from the next generation sequencing can be used to study the population structure of the genotyped individuals. Typical approaches to ethnicity classification/clustering consist of several time consuming pre-processing steps, such as variant filtering, LD-pruning and dimensionality reduction of genotype matrix. We have developed a framework using R programming language to analyze the influence of various pre-processing methods and their parameters on the final results of the classification/clustering algorithms. The results indicated how to fine-tune the pre-processing steps in order to maximize the supervised and unsupervised classification performance. In addition, to enable efficient processing of large data sets, we have developed another framework using Apache Spark. Tests performed on 1000 Genomes data set confirmed the efficiency and scalability of the presented approach. Finally, the dockerized version of the implemented frameworks (freely available at: https://github.com/ZSI-Bio/popgen) can be easily applied to any other variant data set, including data from large scale sequencing projects or custom data sets from clinical laboratories.
更多
查看译文
关键词
Adjust Rand Index, Hadoop Distribute File System, Variant Call Format, Variant Call Format File, Distribute Computing Framework
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要