A Medical Data-Effective Learning Benchmark for Highly Efficient Pre-training of Foundation Models
arxiv(2024)
摘要
Foundation models, pre-trained on massive datasets, have achieved
unprecedented generalizability. However, is it truly necessary to involve such
vast amounts of data in pre-training, consuming extensive computational
resources? This paper introduces data-effective learning, aiming to use data in
the most impactful way to pre-train foundation models. This involves strategies
that focus on data quality rather than quantity, ensuring the data used for
training has high informational value. Data-effective learning plays a profound
role in accelerating foundation model training, reducing computational costs,
and saving data storage, which is very important as the volume of medical data
in recent years has grown beyond many people's expectations. However, due to
the lack of standards and comprehensive benchmarks, research on medical
data-effective learning is poorly studied. To address this gap, our paper
introduces a comprehensive benchmark specifically for evaluating data-effective
learning in the medical field. This benchmark includes a dataset with millions
of data samples from 31 medical centers (DataDEL), a baseline method for
comparison (MedDEL), and a new evaluation metric (NormDEL) to objectively
measure data-effective learning performance. Our extensive experimental results
show the baseline MedDEL can achieve performance comparable to the original
large dataset with only 5
data-effective learning benchmark is crucial for the medical foundation model
research community because it facilitates efficient data use, promotes
collaborative breakthroughs, and fosters the development of cost-effective,
scalable, and impactful healthcare solutions.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要