Self-Supervised Representation Learning of Protein Tertiary Structures (PtsRep) and Its Implications for Protein Engineering

biorxiv(2021)

引用 1|浏览5
暂无评分
摘要
In recent years, deep learning has been increasingly used to decipher the relationships among protein sequence, structure, and function. Thus far these applications of deep learning have been mostly based on primary sequence information, while the vast amount of tertiary structure information remains untapped. In this study, we devised a self-supervised representation learning framework (PtsRep) to extract the fundamental features of unlabeled protein tertiary structures deposited in the PDB, a total of 35,568 structures. The learned embeddings were challenged with two commonly recognized protein engineering tasks: the prediction of protein stability and prediction of the fluorescence brightness of green fluorescent protein (GFP) variants, with training datasets of 16,431 and 26,198 proteins or variants, respectively. On both tasks, PtsRep outperformed the two benchmark methods UniRep and TAPE-BERT, which were pre-trained on two much larger sets of data of 24 and 32 million protein sequences, respectively. Protein clustering analyses demonstrated that PtsRep can capture the structural signatures of proteins. Further testing of the GFP dataset revealed two important implications for protein engineering: (1) a reduced and experimentally manageable training dataset (20%, or 5,239 variants) yielded a satisfactory prediction performance for PtsRep, achieving a recall rate of 70% for the top 26 brightest variants with 795 variants in the testing dataset retrieved; (2) counter-intuitively, when only the bright variants were used for training, the performances of PtsRep and the benchmarks not only did not worsen but they actually slightly improved. This study provides a new avenue for learning and exploring general protein structural representations for protein engineering. ### Competing Interest Statement The authors have declared no competing interest.
更多
查看译文
关键词
protein tertiary structures,ptsrep,self-supervised
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要