Protein-Protein Interaction Sites Prediction Based on an Under-Sampling Strategy and Random Forest Algorithm

IEEE/ACM Transactions on Computational Biology and Bioinformatics(2022)

引用 3|浏览10
暂无评分
摘要
The computational methods of protein-protein interaction sites prediction can effectively avoid the shortcomings of high cost and time in traditional experimental approaches. However, the serious class imbalance between interface and non-interface residues on the protein sequences limits the prediction performance of these methods. This work therefore proposed a new strategy, NearMiss-based under-sampling for unbalancing datasets and Random Forest classification (NM-RF), to predict protein interaction sites. Herein, the residues on protein sequences were represented by the PSSM-derived features, hydropathy index (HI) and relative solvent accessibility (RSA). In order to resolve the class imbalance problem, an under-sampling method based on NearMiss algorithm is adopted to remove some non-interface residues, and then the random forest algorithm is used to perform binary classification on the balanced feature datasets. Experiments show that the accuracy of NM-RF model reaches 87.6% and 84.3% on Dtestset72 and PDBtestset164 respectively, which demonstrate the effectiveness of the proposed NM-RF method in differentiating the interface or non-interface residues.
更多
查看译文
关键词
Protein interaction sites,class imbalance,under-sampling,random forest
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要