AAclust: k-optimized clustering for selecting redundancy-reduced sets of amino acid scales

biorxiv(2024)

引用 0|浏览0
暂无评分
摘要
Summary: Amino acid scales are crucial for sequence-based protein prediction tasks, yet no gold standard scale set or simple scale selection methods exist. We developed AAclust, a wrapper for clustering models that require a pre-defined number of clusters k, such as k-means. AAclust obtains redundancy-reduced scale sets by clustering and selecting one representative scale per cluster, where k can either be optimized by AAclust or defined by the user. The utility of AAclust scale selections was assessed by applying machine learning models to 24 protein benchmark datasets. We found that top-performing scale sets were different for each benchmark dataset and significantly outperformed scale sets used in previous studies. Notably, model performance showed a strong positive correlation with the scale set size. AAclust enables a systematic optimization of scale-based feature engineering in machine learning applications. Availability and implementation: The AAclust algorithm is part of AAanalysis, a Python-based framework for interpretable sequence-based protein prediction, which will be made freely accessible in a forthcoming publication. Supplementary information: Further details on methods and results are provided in Supplementary Material. ### Competing Interest Statement The authors have declared no competing interest.
更多
查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要