Prediction of virus-host association using protein language models and multiple instance learning

biorxiv(2023)

引用 0|浏览29
暂无评分
摘要
Predicting virus-host association is essential to understand how viruses interact with host species, and discovering new therapeutics for viral diseases across humans and animals. Currently, the host of the majority of viruses is unknown. Here, we introduce EvoMIL, a deep learning method that predicts virus-host association at the species level from viral sequence only. The method combines a pre-trained large protein language model and attention-based multiple instance learning (MIL) to allow protein-orientated predictions. Our results show that protein embeddings capture stronger predictive signals than traditional handcrafted features, including amino acids and DNA k-mers, and physio-chemical properties. EvoMIL binary classifiers achieve AUC values of over 0.95 for all prokaryotic and nearly 0.8 for almost all eukaryotic hosts. In multi-host prediction tasks, EvoMIL achieved median performance improvements of 8.6% in prokaryotic hosts and 1.8% in eukaryotic hosts. Furthermore, EvoMIL estimates the importance of single proteins in the prediction and maps them to an embedding landscape of all viral proteins, where proteins with similar functions are distinctly clustered together. ### Competing Interest Statement The authors have declared no competing interest.
更多
查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要