A causality-inspired feature selection method for cancer imbalanced high-dimensional data

biorxiv(2021)

引用 1|浏览9
暂无评分
摘要
It is significant but challenging to explore a subset of robust biomarkers to distinguish cancer from normal samples on high-dimensional imbalanced cancer biological omics data. Although many feature selection methods addressing high dimensionality and class imbalance have been proposed, they rarely pay attention to the fact that most classes will dominate the final decision-making when the dataset is imbalanced, leading to instability when it expands downstream tasks. Because of causality invariance, causal relationship inference is considered an effective way to improve machine learning performance and stability. This paper proposes a Causality-inspired Least Angle Nonlinear Distributed (CLAND) feature selection method, consisting of two branches with a class-wised branch and a sample-wised branch representing two deconfounder strategies, respectively. We compared the performance of CLAND with other advanced feature selection methods in transcriptional data of six cancer types with different imbalance ratios. The genes selected by CLAND have superior accuracy, stability, and generalization in the downstream classification tasks, indicating potential causality for identifying cancer samples. Furthermore, these genes have also been demonstrated to play an essential role in cancer initiation and progression through reviewing the literature. Author Summary Selecting trustworthy biomarkers from high-dimensional data is an important step to help researchers and clinicians understand which genes play key roles in cancer development and progression. A large number of machine learning-based feature selection algorithms have been generated in recent years for biomarker discovery. However, these methods usually show unstable results in the face of class-imbalanced biological data, making it seem unreliable for researchers. Here we introduce the causal theory with the property of causal invariance to aid in the design of feature selection algorithms, analyze how imbalanced distributions affect feature selection methods, and propose a novel causality-based feature selection method. The method with bilateral structure adjusts the data distribution from both class-wise and sample-wise to eliminate the effect of imbalance on the results. Additionally, CLAND can simultaneously address the nonlinearity and high-dimensionality of cancer data, which broaden its application scope. We conducted extensive experiments on six real imbalance cancer datasets and obtained efficient and stable results, while the obtained biomarker has significant biological significance. ### Competing Interest Statement The authors have declared no competing interest.
更多
查看译文
关键词
feature selection method,causality-inspired,high-dimensional
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要