Least-Mean-Squares Coresets for Infinite Streams

IEEE Transactions on Knowledge and Data Engineering(2023)

引用 0|浏览17
暂无评分
摘要
Consider a stream of d-dimensional rows (points in R-d) arriving sequentially. An epsilon-coreset is a positively weighted subset that approximates their sum of squared distances to any linear subspace of R-d, up to a 1 +/- epsilon factor. Unlike other data summarizations, such a coreset: (1) can be used to minimize faster any optimization function that uses this sum, such as regularized or constrained regression, (2) preserves input sparsity; (3) easily interpretable; (4) avoids numerical errors; (5) applies to problems with constraints on the input, such as subspaces that are spanned by few input points. Our main result is the first algorithm that returns such an epsilon-coreset using finite and constant memory during the streaming, i.e., independent of n, the number of rows seen so far. The coreset consists of O(d log(2) d/epsilon(2)) weighted rows, which is nearly optimal according to existing lower bounds of Omega(d/epsilon(2)). We support our findings with experiments on theWikipedia dataset benchmarked against state-of-the-art algorithms.
更多
查看译文
关键词
Big data,coresets,optimization,streaming algorithms
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要