Group Distributionally Robust Dataset Distillation with Risk Minimization
CoRR(2024)
摘要
Dataset distillation (DD) has emerged as a widely adopted technique for
crafting a synthetic dataset that captures the essential information of a
training dataset, facilitating the training of accurate neural models. Its
applications span various domains, including transfer learning, federated
learning, and neural architecture search. The most popular methods for
constructing the synthetic data rely on matching the convergence properties of
training the model with the synthetic dataset and the training dataset.
However, targeting the training dataset must be thought of as auxiliary in the
same sense that the training set is an approximate substitute for the
population distribution, and the latter is the data of interest. Yet despite
its popularity, an aspect that remains unexplored is the relationship of DD to
its generalization, particularly across uncommon subgroups. That is, how can we
ensure that a model trained on the synthetic dataset performs well when faced
with samples from regions with low population density? Here, the
representativeness and coverage of the dataset become salient over the
guaranteed training error at inference. Drawing inspiration from
distributionally robust optimization, we introduce an algorithm that combines
clustering with the minimization of a risk measure on the loss to conduct DD.
We provide a theoretical rationale for our approach and demonstrate its
effective generalization and robustness across subgroups through numerical
experiments.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要