GeoGalactica: A Scientific Large Language Model in Geoscience
arxiv(2023)
摘要
Large language models (LLMs) have achieved huge success for their general
knowledge and ability to solve a wide spectrum of tasks in natural language
processing (NLP). Due to their impressive abilities, LLMs have shed light on
potential inter-discipline applications to foster scientific discoveries of a
specific domain by using artificial intelligence (AI for science, AI4S). In the
meantime, utilizing NLP techniques in geoscience research and practice is wide
and convoluted, contributing from knowledge extraction and document
classification to question answering and knowledge discovery. In this work, we
take the initial step to leverage LLM for science, through a rather
straightforward approach. We try to specialize an LLM into geoscience, by
further pre-training the model with a vast amount of texts in geoscience, as
well as supervised fine-tuning (SFT) the resulting model with our custom
collected instruction tuning dataset. These efforts result in a model
GeoGalactica consisting of 30 billion parameters. To our best knowledge, it is
the largest language model for the geoscience domain. More specifically,
GeoGalactica is from further pre-training of Galactica. We train GeoGalactica
over a geoscience-related text corpus containing 65 billion tokens, preserving
as the largest geoscience-specific text corpus. Then we fine-tune the model
with 1 million pairs of instruction-tuning data consisting of questions that
demand professional geoscience knowledge to answer. In this technical report,
we will illustrate in detail all aspects of GeoGalactica, including data
collection, data cleaning, base model selection, pre-training, SFT, and
evaluation. We open-source our data curation tools and the checkpoints of
GeoGalactica during the first 3/4 of pre-training.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要