Data Nuggets: A Method for Reducing Big Data While Preserving Data Structure
Journal of Computational and Graphical Statistics(2024)
摘要
Big data, with NxP dimension where N is extremely large, has created new
challenges for data analysis, particularly in the realm of creating meaningful
clusters of data. Clustering techniques, such as K-means or hierarchical
clustering are popular methods for performing exploratory analysis on large
datasets. Unfortunately, these methods are not always possible to apply to big
data due to memory or time constraints generated by calculations of order
PxN(N-1). To circumvent this problem, typically, the clustering technique is
applied to a random sample drawn from the dataset: however, a weakness is that
the structure of the dataset, particularly at the edges, is not necessarily
maintained. We propose a new solution through the concept of "data nuggets",
which reduce a large dataset into a small collection of nuggets of data, each
containing a center, weight, and scale parameter. The data nuggets are then
input into algorithms that compute methods such as principal components
analysis and clustering in a more computationally efficient manner. We show the
consistency of the data nuggets-based covariance estimator and apply the
methodology of data nuggets to perform exploratory analysis of a flow cytometry
dataset containing over one million observations using PCA and K-means
clustering for weighted observations. Supplementary materials for this article
are available online.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要