A New Scalable Approach For Missing Value Imputation In High-Throughput Microarray Data On Apache Spark

INTERNATIONAL JOURNAL OF DATA MINING AND BIOINFORMATICS(2020)

引用 2|浏览0
暂无评分
摘要
Data acquisition of high-dimensional data is performed using High-Throughput Technology (HTT). Data extracted using HTT contain the large amount of missing values. Gene expression data are vital in healthcare research; therefore, reconstruction of missing value is a challenging task. In the research work, a scalable technique PC-ImNN is proposed that stands for Pearson correlation involving with Monte Carlo and modified Nearest Neighbour method to predict the missing value. Monte Carlo is the technique that uses the procedure of repeated random sampling to make numerical estimations of unknown parameters. Pearson correlation combined with Monte Carlo to maintain the distribution of estimated datapoints. Nearest Neighbour technique is applied to find the nearest estimated datapoints. Proposed model is compared with five existing imputation techniques. The result shows that proposed technique performs better in term of mean square error and imputation accuracy. In the work, Apache Spark is used to speed up the performance.
更多
查看译文
关键词
missing value, Pearson's correlation, nearest neighbour, mean square error, Monte Carlo method, support vector machine, microarray data
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要