Data Mixing Laws: Optimizing Data Mixtures by Predicting Language Modeling Performance
arxiv(2024)
摘要
Pretraining data of large language models composes multiple domains (e.g.,
web texts, academic papers, codes), whose mixture proportions crucially impact
the competence of outcome models. While existing endeavors rely on heuristics
or qualitative strategies to tune the proportions, we discover the quantitative
predictability of model performance regarding the mixture proportions in
function forms, which we refer to as the data mixing laws. Fitting such
functions on sample mixtures unveils model performance on unseen mixtures
before actual runs, thus guiding the selection of an ideal data mixture.
Furthermore, we propose nested use of the scaling laws of training steps, model
sizes, and our data mixing law to enable predicting the performance of large
models trained on massive data under various mixtures with only small-scale
training. Moreover, experimental results verify that our method effectively
optimizes the training mixture of a 1B model trained for 100B tokens in
RedPajama, reaching a performance comparable to the one trained for 48
steps on the default mixture. Extending the application of data mixing laws to
continual training accurately predicts the critical mixture proportion that
avoids catastrophic forgetting and outlooks the potential for dynamic data
schedules
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要