PASV: Automatic protein partitioning and validation using conserved residues
biorxiv(2021)
摘要
Background Increasingly, researchers use protein-coding genes from targeted PCR amplification or direct metagenomic sequencing in community and population ecology. Analysis of protein-coding genes presents different challenges from those encountered in traditional SSU rRNA studies. Most protein-coding sequences are annotated based on homology to other computationally-annotated sequences, which can lead to inaccurate annotations. Therefore, the results of sensitive homology searches must be validated to remove false-positives and assess functionality. Multiple lines of in silico evidence can be gathered by examining conserved domains and residues identified through biochemical investigations. However, manually validating sequences in this way can be time consuming and error prone, especially in large environmental studies.
Results An automated pipeline for protein active site validation (PASV) was developed to improve validation and partitioning accuracy for protein-coding sequences, combining multiple sequence alignment with expert domain knowledge. PASV was tested using commonly misannotated proteins: ribonucleotide reductase (RNR), alternative oxidase (AOX), and plastid terminal oxidase (PTOX). PASV partitioned 9,906 putative Class I alpha and Class II RNR sequences from bycatch in a global viral metagenomic investigation with >99% true positive and true negative rates. PASV predicted the class of 2,579 RNR sequences in >98% agreement with manual annotations. PASV correctly partitioned all 336 tested AOX and PTOX sequences.
Conclusions PASV provides an automated and accurate way to address post-homology search validation and partitioning of protein-coding marker genes. Source code is released under the MIT license and is found with documentation and usage examples on GitHub at .
### Competing Interest Statement
The authors have declared no competing interest.
* AOX
: alternative oxidase
BRL
: branch length
CDD
: conserved domain database
GOV
: global ocean virome
IQR
: interquartile range
LOESS
: locally estimated scatterplot smoothing
MSA
: multiple sequence alignment
NCBI
: National Center for Biotechnology Information
PASV
: protein active site validation
PFL
: pyruvate formate lyase
Pol I
: DNA polymerase I
PTOX
: plastid terminal oxidase
RNR
: ribonucleotide reductase
ROI
: region of interest
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要