Homology detection using a protein secondary structure-based large language model

Roman Kogay,Weicheng Ma, Jad Bousselham, Zechen Yang,Daniel Rockmore,Olga Zhaxybayeva,Soroush Vosoughi

biorxiv(2023)

引用 0|浏览1
暂无评分
摘要
Detection of homology among proteins is fundamental to understanding protein function. Unfortunately, traditional homology searches using amino acid sequence similarity are limited when numerous amino acid substitutions have accumulated either due to billions of years of evolution or through processes of accelerated change. Recent applications of deep-learning approaches demonstrate that "protein language" models of amino acid sequences can improve the accuracy of the traditional homology searches. Ultimately, the ability to work seamlessly with tertiary structures of proteins will solve the homology detection challenge and provide accompanying insights directly related to function, but to date the use of 3D structures suffers both from data availability and computational bottlenecks. Herein, we present the Protein Secondary Structure Language (ProSSL) model, an efficient encoding of protein secondary structure information in a Transformer-based deep-learning architecture. We conjecture that the secondary protein structure, which is better conserved than primary sequences and much more easily predictable and available than tertiary protein structure, could aid in the task of homology detection. ProSSL has the computational advantages of primary sequence-based homology detection, while also providing important structural information for similarity scoring. Using two case studies of large, diverse viral protein families, we show that the ProSSL model successfully captures patterns of secondary structure arrangements and is effective in detecting homologs either as a pre-trained or fine-tuned model. In both tasks, we accurately detect members of these protein families, including those missed in traditional amino acid similarity searches. We also illustrate how functional insights from the individual ProSSL models could be obtained from the use of the Shapley Additive exPlanations (SHAP) values. ### Competing Interest Statement The authors have declared no competing interest.
更多
查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要