Lifelong Benchmarks: Efficient Model Evaluation in an Era of Rapid Progress
CoRR(2024)
摘要
Standardized benchmarks drive progress in machine learning. However, with
repeated testing, the risk of overfitting grows as algorithms over-exploit
benchmark idiosyncrasies. In our work, we seek to mitigate this challenge by
compiling ever-expanding large-scale benchmarks called Lifelong Benchmarks. As
exemplars of our approach, we create Lifelong-CIFAR10 and Lifelong-ImageNet,
containing (for now) 1.69M and 1.98M test samples, respectively. While reducing
overfitting, lifelong benchmarks introduce a key challenge: the high cost of
evaluating a growing number of models across an ever-expanding sample set. To
address this challenge, we also introduce an efficient evaluation framework:
Sort & Search (S S), which reuses previously evaluated models by leveraging
dynamic programming algorithms to selectively rank and sub-select test samples,
enabling cost-effective lifelong benchmarking. Extensive empirical evaluations
across 31,000 models demonstrate that S S achieves highly-efficient approximate
accuracy measurement, reducing compute cost from 180 GPU days to 5 GPU hours
(1000x reduction) on a single A100 GPU, with low approximation error. As such,
lifelong benchmarks offer a robust, practical solution to the "benchmark
exhaustion" problem.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要