Feature Ranking and Selection for Big Data Sets.

ADBIS (Short Papers and Workshops)(2016)

引用 23|浏览17
暂无评分
摘要
The availability of big data sets has led to the successful application of machine learning and data mining to problems that were previously unsolved. The use of these techniques, though, is rarely straightforward. High dimensionality is often one of the main obstacles that must be overcome before learning an adequate model or drawing useful conclusions from large amounts of data. Rank revealing matrix factorizations can help in addressing this problem, by permuting the columns of the input data so that linearly dependent and thus redundant ones are moved to the right. These factorizations, however, are designed to operate in a centralized fashion, requiring the input data to be loaded into main memory, which makes them inapplicable to large data sets. In this paper we prove that data sets comprised of a huge number of rows can be easily transformed into a compact square matrix that preserves the permutation yielded by rank revealing QR factorizations. This leads to a simple algorithm for running these factorizations on big data sets regardless of their number of rows. The nature of the transformation makes it also possible to deal with high dimensional data with a controlled loss of precision. We offer experimental results showing that our method can provide improvements for the k-means algorithm, both in clustering results and in running time.
更多
查看译文
关键词
Feature selection, Unsupervised learning, Big data
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要