Stability Oracle: A Structure-Based Graph-Transformer for Identifying Stabilizing Mutations

Daniel J. Diaz,Chengyue Gong,Jeffrey Ouyang-Zhang, James M. Loy, Jeremy M. Wells, David Yang,Andrew D. Ellington, Alex Dimakis,Adam R. Klivans

bioRxiv (Cold Spring Harbor Laboratory)(2023)

Cited 1|Views6
No score
Abstract Stabilizing proteins is a fundamental challenge in protein engineering and is almost always a prerequisite for the development of industrial and pharmaceutical biotechnologies. Here we present Stability Oracle: a structure-based graph-transformer framework that achieves state-of-the-art performance on predicting the effect of a point mutation on a protein’s thermodynamic stability (ΔΔG). A strength of our model is its ability to identify stabilizing mutations, which often make up a small fraction of a protein’s mutational landscape. Our framework introduces several data and machine learning innovations to overcome well-known challenges in data scarcity and bias, generalization, and computation time. Stability Oracle is first pretrained on over 2M masked microenvironments and then fine-tuned using a novel data augmentation technique, Thermodynamic Permutations (TP), applied to a ∼120K curated subset of the mega-scale cDNA display proteolysis dataset. This technique increases the original 120K mutations to over 2M thermodynamically valid ΔΔG measurements to generate the first structure training set that samples and balances all 380 mutation types. By using the masked microenvironment paradigm, Stability Oracle does not require a second mutant structure and instead uses amino acid structural embeddings to represent a mutation. This architectural design accelerates training and inference times: we can both train on 2M instances with just 119 structures and generate deep mutational scan (DMS) predictions from only the wildtype structure. We benchmark Stability Oracle with both experimental and AlphaFold structures of all proteins on T2837, a test set that aggregates the common test sets (SSym, S669, p53, and Myoglobin) with all additional experimental data from proteins with over a 30% sequence similarity overlap. We used TP augmented T2837 to evaluate performance for engineering protein stability: Stability Oracle correctly identifies 48% of stabilizing mutations (ΔΔG < −0.5 kcal/mol) and 74% of its stabilizing predictions are indeed stabilizing (18% and 8% of predictions were neutral and destabilizing, respectively). For a fair comparison between sequence and structure-based fine-tuned deep learning models, we build on the Prostata framework and fine-tune the sequence embeddings of ESM2 on our training set (Prostata-IFML). A head-to-head comparison demonstrates that Stability Oracle outperforms Prostata-IFML on regression and classification even though the model is 548 times smaller and is pretrained with 4000 times fewer proteins, highlighting the advantages of learning from structures.
Translated text
Key words
stabilizing mutations,stability,structure-based,graph-transformer
AI Read Science
Must-Reading Tree
Generate MRT to find the research sequence of this paper
Chat Paper
Summary is being generated by the instructions you defined