Dolma: an Open Corpus of Three Trillion Tokens for Language Model Pretraining Research
CoRR(2024)
摘要
Language models have become a critical technology to tackling a wide range of
natural language processing tasks, yet many details about how the
best-performing language models were developed are not reported. In particular,
information about their pretraining corpora is seldom discussed: commercial
language models rarely provide any information about their data; even open
models rarely release datasets they are trained on, or an exact recipe to
reproduce them. As a result, it is challenging to conduct certain threads of
language modeling research, such as understanding how training data impacts
model capabilities and shapes their limitations. To facilitate open research on
language model pretraining, we release Dolma, a three trillion tokens English
corpus, built from a diverse mixture of web content, scientific papers, code,
public-domain books, social media, and encyclopedic materials. In addition, we
open source our data curation toolkit to enable further experimentation and
reproduction of our work. In this report, we document Dolma, including its
design principles, details about its construction, and a summary of its
contents. We interleave this report with analyses and experimental results from
training language models on intermediate states of Dolma to share what we have
learned about important data curation practices, including the role of content
or quality filters, deduplication, and multi-source mixing. Dolma has been used
to train OLMo, a state-of-the-art, open language model and framework designed
to build and study the science of language modeling.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要