Towards Lossless Dataset Distillation via Difficulty-Aligned Trajectory Matching
arxiv(2023)
摘要
The ultimate goal of Dataset Distillation is to synthesize a small synthetic
dataset such that a model trained on this synthetic set will perform equally
well as a model trained on the full, real dataset. Until now, no method of
Dataset Distillation has reached this completely lossless goal, in part due to
the fact that previous methods only remain effective when the total number of
synthetic samples is extremely small. Since only so much information can be
contained in such a small number of samples, it seems that to achieve truly
loss dataset distillation, we must develop a distillation method that remains
effective as the size of the synthetic dataset grows. In this work, we present
such an algorithm and elucidate why existing methods fail to generate larger,
high-quality synthetic sets. Current state-of-the-art methods rely on
trajectory-matching, or optimizing the synthetic data to induce similar
long-term training dynamics as the real data. We empirically find that the
training stage of the trajectories we choose to match (i.e., early or late)
greatly affects the effectiveness of the distilled dataset. Specifically, early
trajectories (where the teacher network learns easy patterns) work well for a
low-cardinality synthetic set since there are fewer examples wherein to
distribute the necessary information. Conversely, late trajectories (where the
teacher network learns hard patterns) provide better signals for larger
synthetic sets since there are now enough samples to represent the necessary
complex patterns. Based on our findings, we propose to align the difficulty of
the generated patterns with the size of the synthetic dataset. In doing so, we
successfully scale trajectory matching-based methods to larger synthetic
datasets, achieving lossless dataset distillation for the very first time. Code
and distilled datasets are available at https://gzyaftermath.github.io/DATM.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要