Ensemble partial least squares regression for descriptor selection, outlier detection, applicability domain assessment, and ensemble modeling in QSAR/QSPR modeling

JOURNAL OF CHEMOMETRICS(2017)

引用 24|浏览39
暂无评分
摘要
In QSAR/QSPR modeling, building an accurate partial least squares (PLS) model usually involves descriptor selection, outlier detection, applicability domain assessment, nonlinear relationship, and model stability problems. In the present study, we presented an ensemble PLS (EnPLS) method for solving these modeling tasks under a unified methodology framework. EnPLS aims at developing a consistent algorithmic framework by means of the idea of ensemble learning and statistical distribution. The approach exploits the fact that the distribution of PLS model coefficients provides a mechanism for ranking and interpreting the effects of variables, whereas the distribution of prediction errors provides a mechanism for differentiating the outliers from normal samples and assessing the applicability domain of models. The use of statistics of these distributions, namely, mean/median value and standard deviation, inherently provides a feasible way to effectively describe the information contained by the original samples. Furthermore, ensemble modeling and prediction based on several cross-predictive PLS models could effectively improve the model prediction performance and increase the model stability to a certain extent. The aqueous solubility data are used to demonstrate the ability of our proposed EnPLS method in solving various modeling tasks such as descriptor selection, outlier detection, applicability domain assessment, performance improvement, and model stability. Finally, a freely available R package implementing EnPLS is developed to facilitate the use of chemists and pharmacologists. The R package is freely available at . In the present study, we presented an ensemble PLS method for solving these modeling tasks under a unified methodology framework. EnPLS aims at developing a consistent algorithmic framework by means of the idea of ensemble learning and statistical distribution. The use of statistics of these distributions inherently provides a feasible way to effectively describe the information contained by the original samples.
更多
查看译文
关键词
applicability domain assessment,ensemble learning,outlier detection,partial least squares (PLS),QSAR,QSPR,variable selection
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要