Data Subset Selection via Machine Teaching

ICLR 2023(2023)

引用 0|浏览27
暂无评分
摘要
We study the problem of data subset selection: given a fully labeled dataset and a training procedure, select a subset such that training on that subset yields approximately the same test performance as training on the full dataset. We propose an algorithm, inspired by recent work in machine teaching, that has theoretical guarantees, compelling empirical performance, and is model-agnostic meaning the algorithm's only information comes from the predictions of models trained on subsets. Furthermore, we prove lower bounds that show that our algorithm achieves a subset with near-optimal size (under computational hardness assumptions) while training on a number of subsets that is optimal up to extraneous log factors. We then empirically compare our algorithm, machine teaching algorithms, and coreset techniques on six common image datasets with convolutional neural networks. We find that our machine teaching algorithm can find a subset of CIFAR10 of size less than 16k that yields the same performance (5-6% error) as training on the full dataset of size 50k.
更多
查看译文
关键词
Data pruning,data selection,machine teaching
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要