mitoDataclean: A machine learning approach for the accurate identification of cross-contamination-derived tumor mitochondrial DNA mutations

INTERNATIONAL JOURNAL OF CANCER(2022)

引用 2|浏览26
暂无评分
摘要
Next-generation sequencing (NGS) of mitochondrial DNA (mtDNA) has widespread applications in aging and cancer studies. However, cross-contamination of mtDNA constitutes a major concern. Previous methods for the detection of mtDNA contamination mainly focus on haplogroup-level phylogeny, but neglect haplotype-level differences, leading to limited sensitivity and accuracy. In our study, we present mitoDataclean, a random-forest-based machine learning package for accurate identification of cross-contamination, evaluation of contamination levels and detection of contamination-derived variants in mtDNA NGS data. Comprehensive optimization of mitoDataclean revealed that training simulation with mixtures of small haplogroup distance and low polymorphic difference was critical for optimal modeling. Compared to existing methods, mitoDataclean exhibited significantly improved sensitivity and accuracy for the detection of sample contamination in simulated data. In addition, mitoDataclean achieved area under the curve values of 0.91 and 0.97 for discerning genuine and contamination-derived mtDNA variants in a simulated Western dataset and private sequencing contamination data, respectively, suggesting that this tool may be applicable for different populations and samples with different sources of contamination. Finally, mitoDataclean was further evaluated in several private and public datasets and showed a robust ability for contamination detection. Altogether, our study demonstrates that mitoDataclean may be used for accurate detection of contaminated samples and contamination-derived variants in mtDNA NGS data.
更多
查看译文
关键词
machine learning, mitochondrial DNA, next-generation sequencing, sample cross-contamination
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要