IndicLLMSuite: A Blueprint for Creating Pre-training and Fine-Tuning Datasets for Indian Languages
arxiv(2024)
摘要
Despite the considerable advancements in English LLMs, the progress in
building comparable models for other languages has been hindered due to the
scarcity of tailored resources. Our work aims to bridge this divide by
introducing an expansive suite of resources specifically designed for the
development of Indic LLMs, covering 22 languages, containing a total of 251B
tokens and 74.8M instruction-response pairs. Recognizing the importance of both
data quality and quantity, our approach combines highly curated manually
verified data, unverified yet valuable data, and synthetic data. We build a
clean, open-source pipeline for curating pre-training data from diverse
sources, including websites, PDFs, and videos, incorporating best practices for
crawling, cleaning, flagging, and deduplication. For instruction-fine tuning,
we amalgamate existing Indic datasets, translate/transliterate English datasets
into Indian languages, and utilize LLaMa2 and Mixtral models to create
conversations grounded in articles from Indian Wikipedia and Wikihow.
Additionally, we address toxicity alignment by generating toxic prompts for
multiple scenarios and then generate non-toxic responses by feeding these toxic
prompts to an aligned LLaMa2 model. We hope that the datasets, tools, and
resources released as a part of this work will not only propel the research and
development of Indic LLMs but also establish an open-source blueprint for
extending such efforts to other languages. The data and other artifacts created
as part of this work are released with permissive licenses.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要