Quality control and annotation of variant peptides identified through Proteogenomics

biorxiv(2023)

引用 0|浏览1
暂无评分
摘要
The importance of single nucleotide polymorphisms (SNPs) is well known in diseases but which ones are translated to form single amino acid variant (SAV or variant) is not studied in detail. There may be translational potential of such information but this necessitates integrated analysis of genomics and proteomics data. Identification of novel and variant peptides using proteogenomics has paved the way to understand their unique and diverse phenotypic relationships in health and disease. Proteogenomic studies on cancer, neurological and cardiovascular diseases have revealed many variants of clinical importance. However, false positives owing to large database are a major challenge in proteogenomics, and even at strict FDR, it is a challenge to segregate the true from false variant hits. Some approaches have been suggested to circumvent this problem, such as class-specific FDR estimation and filtering workflows for better sensitivity. Implementation of these methods and workflows on database search results is challenging for biologists and requires advanced bioinformatics skills. Owing to their computational complexity, such methods are not readily accessible to biologists. The goal of this study was to develop an accessible tool for quality control of variant peptides. We analyzed many descriptors that pertain to PSM match quality, variant events and peptide matches to evaluate their ability to distinguish true variants from false positives in proteogenomics. These features were used to develop a variant ambiguity score (VAS) which was implemented into the tool PgxSAVy, which has the framework to re-score the proteogenomics variants resulting from single or multiple search algorithms, to classify the SAVs based on their quality of evidence. To evaluate VAS, we tested it on a simulated data with true and false SAVs and observed that it was able to segregate true and false variants effectively (true hits 86.6% and sensitivity 95.93%). Manual annotation of identified variant PSMs from one fraction of a large-dataset (PXD004010) also demonstrated that VAS was highly effective in segregating true and false variants in automated manner. We also used large public data with approximately 2.8 million spectra (PXD004010 and PXD001468) for a comprehensive evaluation of PgxSAVy. Using these datasets, PgxSAVy identified and filtered ∼50% false variants which suggests that proteogenomics variants be rigorously tested before making biological conclusions. We also integrated current knowledge on variants using an annotation framework in PgxSAVy to annotate the variants based on their known role in diseases. PgxSAVy provides a rigorous framework for quality control and annotations of variant peptide results from one or more search algorithms and helps the researchers to prioritize these variants for further studies. ### Competing Interest Statement The authors have declared no competing interest.
更多
查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要