PASV: Automatic protein partitioning and validation using conserved residues

biorxiv(2021)

引用 4|浏览10
暂无评分
摘要
Background Increasingly, researchers use protein-coding genes from targeted PCR amplification or direct metagenomic sequencing in community and population ecology. Analysis of protein-coding genes presents different challenges from those encountered in traditional SSU rRNA studies. Most protein-coding sequences are annotated based on homology to other computationally-annotated sequences, which can lead to inaccurate annotations. Therefore, the results of sensitive homology searches must be validated to remove false-positives and assess functionality. Multiple lines of in silico evidence can be gathered by examining conserved domains and residues identified through biochemical investigations. However, manually validating sequences in this way can be time consuming and error prone, especially in large environmental studies. Results An automated pipeline for protein active site validation (PASV) was developed to improve validation and partitioning accuracy for protein-coding sequences, combining multiple sequence alignment with expert domain knowledge. PASV was tested using commonly misannotated proteins: ribonucleotide reductase (RNR), alternative oxidase (AOX), and plastid terminal oxidase (PTOX). PASV partitioned 9,906 putative Class I alpha and Class II RNR sequences from bycatch in a global viral metagenomic investigation with >99% true positive and true negative rates. PASV predicted the class of 2,579 RNR sequences in >98% agreement with manual annotations. PASV correctly partitioned all 336 tested AOX and PTOX sequences. Conclusions PASV provides an automated and accurate way to address post-homology search validation and partitioning of protein-coding marker genes. Source code is released under the MIT license and is found with documentation and usage examples on GitHub at . ### Competing Interest Statement The authors have declared no competing interest. * AOX : alternative oxidase BRL : branch length CDD : conserved domain database GOV : global ocean virome IQR : interquartile range LOESS : locally estimated scatterplot smoothing MSA : multiple sequence alignment NCBI : National Center for Biotechnology Information PASV : protein active site validation PFL : pyruvate formate lyase Pol I : DNA polymerase I PTOX : plastid terminal oxidase RNR : ribonucleotide reductase ROI : region of interest
更多
查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要