Spectrum Pursuit With Residual Descent for Column Subset Selection Problem: Theoretical Guarantees and Applications in Deep Learning

Saeed Vahidian,Mohsen Joneidi, Ashkan Esmaeili,Siavash Khodadadeh, Sharareh Zehtabian,Bill Lin

IEEE ACCESS(2022)

引用 2|浏览14
暂无评分
摘要
We propose a novel technique for dataset summarization by selecting representatives from a large, unsupervised dataset. The approach is based on the concept of self-rank, defined as the minimum number of samples needed to express all dataset samples with an accuracy proportional to the rank-K approximation. As the exact computation of self-rank requires a computationally expensive combinatorial search, we propose an efficient algorithm that jointly estimates self-rank and selects the most informative samples in a linear order of complexity w.r.t the data size. We derive a new upper bound for the approximation ratio (AR), the ratio of obtained projection error using selected samples to the best rank-K approximation error. The best previously known AR for self-representative low-rank approximation was presented in ICML 2017, which was further improved by the bound root 1 + K reported in NeurIPS 2019. Both of these bounds are obtained by brute force search, which is not practical, and these bounds depend solely on K, the number of selected samples. In contrast, we describe an adaptive AR that takes into consideration the spectral properties and spikiness measure of the original dataset, A is an element of R-NxM. In particular, our performance bound is proportional to the condition number kappa(A). Our derived AR is expressed as 1 + (kappa(A)(2) - 1)/(N - K), which approaches 1 and is optimal in two extreme spectral distribution instances. In the worst case, AR is shown not to exceed 1.25 for the proposed method. Our proposed algorithm enjoys linear complexity w.r.t. size of original dataset, which results in filling a historical gap between practical and theoretical methods in finding representatives. In addition to evaluating the proposed algorithm on a synthetic dataset, we show that it can be utilized in real-world applications such as graph node sampling for optimizing the shortest path criterion, learning a classifier with representative data, and open-set identification.
更多
查看译文
关键词
Column subset selection, data summarization, graph node selection, low-rank, meta learning, open-set identification, self-rank, spectrum pursuit, SVD
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要