Genome-wide prediction of disease variants with a deep protein language model

biorxiv(2022)

引用 13|浏览3
暂无评分
摘要
Distinguishing between damaging and neutral missense variants is an ongoing challenge in human genetics, with profound implications for clinical diagnosis, genetic studies and protein engineering. Recently, deep-learning models have achieved state-of-the-art performance in classifying variants as pathogenic or benign. However, these models are currently unable to provide predictions over all missense variants, either because of dependency on close protein homologs or due to software limitations. Here we leveraged ESM1b, a 650M-parameter protein language model, to predict the functional impact of human coding variation at scale. To overcome existing technical limitations, we developed a modified ESM1b workflow and functionalized, for the first time, all proteins in the human genome, resulting in predictions for all ∼450M possible missense variant effects. ESM1b was able to distinguish between pathogenic and benign variants across ∼150K variants annotated in ClinVar and HGMD, outperforming existing state-of-the-art methods. ESM1b also exceeded the state of the art at predicting the experimental results of deep mutational scans. We further annotated ∼2M variants across ∼9K alternatively-spliced genes as damaging in certain protein isoforms while neutral in others, demonstrating the importance of considering all isoforms when functionalizing variant effects. The complete catalog of variant effect predictions is available at: [https://huggingface.co/spaces/ntranoslab/esm_variants][1]. ### Competing Interest Statement The authors have declared no competing interest. [1]: http://huggingface.co/spaces/ntranoslab/esm_variants
更多
查看译文
关键词
disease variants,prediction,protein,genome-wide
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要