ECAmyloid: An amyloid predictor based on ensemble learning and comprehensive sequence-derived features.

Computational biology and chemistry(2023)

引用 2|浏览1
暂无评分
摘要
Amyloid fibrils formed by the mis-aggregation of amyloid proteins can lead to neuronal degenerations in the Alzheimer's disease. Predicting amyloid proteins not only contributes to understanding physicochemical properties and formation mechanism of amyloid proteins, but also has significant implications in the amyloid disease treatment and the development of a new purpose for amyloid materials. In this study, an ensemble learning model with sequence-derived features, ECAmyloid, is proposed to identify amyloids. The sequence-derived features including Pseudo Position Specificity Score Matrix (Pse-PSSM), Split Amino Acid Composition (SAAC), Solvent Accessibility (SA), and Secondary Structure Information (SSI) are employed to incorporate sequence composition, evolutionary and structural information. The individual learners of the ensemble learning model are selected by an increment classifier selection strategy. The final prediction results are determined by voting of prediction results of multiple individual learners. In view of the imbalanced benchmark dataset, the Synthetic Minority Over-sampling Technique (SMOTE) is adopted to generate positive samples. To eliminate irrelevant features and redundant features, correlation-based feature subset (CFS) selection combined with a heuristic search strategy is performed to obtain the optimal feature subset. Experimental results indicate that the ensemble classifier achieves an accuracy of 98.29%, a sensitivity of 0.992, a specificity of 0.974 on the training dataset using the 10-fold cross validation, far higher than the results obtained by its individual learners. Compared with the original feature set, the accuracy, sensitivity, specificity, MCC, F1-score, G-Mean of the ensemble method trained by the optimal feature subset are improved by 1.05%, 0.012, 0.01, 0.021, 0.011 and 0.011, respectively. Moreover, the comparison results with existing methods on two same independent test datasets demonstrate that the proposed method is an effective and promising predictor for large-scale determination of amyloid proteins. The data and code used to develop ECAmyloid has been shared to Github, and can be freely downloaded at https://github.com/KOALA-L/ECAmyloid.git.
更多
查看译文
关键词
Amyloid,Correlation-Based Feature Subset Selection,Ensemble Learning,Sequence-derived Features,Synthetic Minority Over-sampling Technique
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要