Probabilistic Leverage Scores for Parallelized Unsupervised Feature Selection.

ADVANCES IN COMPUTATIONAL INTELLIGENCE, IWANN 2017, PT II(2017)

引用 2|浏览19
暂无评分
摘要
Dimensionality reduction is often crucial for the application of machine learning and data mining. Feature selection methods can be employed for this purpose, with the advantage of preserving interpretability. There exist unsupervised feature selection methods based on matrix factorization algorithms, which can help choose the most informative features in terms of approximation error. Randomized methods have been proposed recently to provide better theoretical guarantees and better approximation errors than their deterministic counterparts, but their computational costs can be significant when dealing with big, high dimensional data sets. Some existing randomized and deterministic approaches require the computation of the singular value decomposition in O(mn min(m,n)) time (for m samples and n features) for providing leverage scores. This compromises their applicability to domains of even moderately high dimensionality. In this paper we propose the use of Probabilistic PCA to compute the leverage scores in O(mnk) time, enabling the applicability of some of these randomized methods to large, high-dimensional data sets. We show that using this approach, we can rapidly provide an approximation of the leverage scores that is works well in this context. In addition, we offer a parallelized version over the emerging Resilient Distributed Datasets paradigm (RDD) on Apache Spark, making it horizontally scalable for enormous numbers of data instances. We validate the performance of our approach on different data sets comprised of real-world and synthetic data.
更多
查看译文
关键词
Machine learning,Feature selection,Distributed computing
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要