Geoscience Knowledge Understanding and Utilization via Data-centric Large Language Model

crossref(2024)

引用 0|浏览0
暂无评分
摘要
Large language models (LLMs) have made substantial progress in general natural language processing domains. GeoLM represents a significant stride in adapting LLMs for geoscience, with the goal of enhancing research and practical applications in this specialized area. We have developed two distinct models: a 7-billion-parameter LLM named K2, which is trained on a 5.5-billion-token geoscience text corpus that includes over 1 million pieces of geoscience literature, and a 30-billion-parameter LLM, GeoGalactica, trained on an extensive 65-billion-token corpus related to geoscience. Supported by the Deep-time Digital Earth (DDE) project, we preserve the largest text corpus specifically designed for geoscience. The efficacy of LLMs in the geoscience domain is fundamentally linked to the access to and deep understanding of extensive geoscience data. In this respect, data-centric AI is crucial. We put forward a framework, GeoLM, to tackle the challenges of data science within geosciences, integrating techniques such as information extraction, data integration, and mining. The GeoLM framework is dedicated to constructing and applying data-centric Geoscience LLMs, with the aim of enabling the wider scientific community to harness these advanced models for a more profound understanding and effective application of geoscience knowledge.
更多
查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要