Identifying key amino acid types that distinguish paralogous proteins using Shapley value based feature subset selection

biorxiv(2024)

引用 0|浏览0
暂无评分
摘要
We view a protein as the composite of the standard 20 amino acids (ignoring their order in the protein sequence) and try to identify a set of important amino acid types whose composition is enough to distinguish two paralogous proteins. For this, we use a linear classifier with amino acid composition as features, and a Shapley value based feature subset selection algorithm. We demonstrate our method using 15 datasets of pairs of paralogous proteins. We find that the amino acid composition feature is adequate to distinguish many paralogous proteins from each other. For a pair of paralogous proteins, we are able to identify a subset of amino acids, referred to as AFS (amino acid feature subset), that are key to distinguish them, for each protein. We validate the ability of the AFS amino acids to discriminate by analyzing multiple sequence alignments of corresponding protein families and/or by providing supporting evidence from literature. We also pair-wise classify sub-families of a protein superfamily and highlight common amino acids identified in the AFS for two pairs with a common sub-family. ### Competing Interest Statement The authors have declared no competing interest.
更多
查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要