Scoring alignments by embedding vector similarity

Sepehr Ashrafzadeh,G. Brian Golding,Lucian Ilie

bioRxiv (Cold Spring Harbor Laboratory)(2023)

引用 0|浏览1
暂无评分
摘要
Sequence similarity is of paramount importance in biology, as similar sequences tend to have similar function and share common ancestry. Scoring matrices, such as PAM or BLO-SUM, play a crucial role in all bioinformatics algorithms for identifying similarities, but have the drawback that they are fixed, independent of context. We propose a new scoring method for amino acid similarity that remedies this weakness, being contextually dependent. It relies on recent advances in deep learning architectures that leverage the power of enormous amounts of unlabelled data in order to generate contextual embeddings, which are vector representations for words. These ideas have been applied to protein sequences, producing embedding vectors for protein residues. We propose the E-score between two residues as the cosine similarity between their embedding vector representations. Thorough testing on a wide variety of reference multiple sequence alignments indicate that the alignments produced using the best such method, ProtT5-score, are significantly better than those obtained using BLOSUM matrices. The new method proposes to change the way alignments are computed, with far reaching implications in all areas of textual data that use sequence similarity. The program to compute alignments based on various E -scores is available as a web server at [e-score.csd.uwo.ca][1]. The source code is freely available for download from [github.com/lucian-ilie/E-score][2]. ### Competing Interest Statement The authors have declared no competing interest. [1]: http://e-score.csd.uwo.ca/ [2]: https://github.com/lucian-ilie/E-score
更多
查看译文
关键词
scoring alignments,vector
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要