Learning protein fitness models from evolutionary and assay-labeled data

NATURE BIOTECHNOLOGY(2022)

引用 50|浏览21
暂无评分
摘要
Machine learning-based models of protein fitness typically learn from either unlabeled, evolutionarily related sequences or variant sequences with experimentally measured labels. For regimes where only limited experimental data are available, recent work has suggested methods for combining both sources of information. Toward that goal, we propose a simple combination approach that is competitive with, and on average outperforms more sophisticated methods. Our approach uses ridge regression on site-specific amino acid features combined with one probability density feature from modeling the evolutionary data. Within this approach, we find that a variational autoencoder-based probability density model showed the best overall performance, although any evolutionary density model can be used. Moreover, our analysis highlights the importance of systematic evaluations and sufficient baselines.
更多
查看译文
关键词
Machine learning,Protein design,Life Sciences,general,Biotechnology,Biomedicine,Agriculture,Biomedical Engineering/Biotechnology,Bioinformatics
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要