Towards Building Multilingual Language Model for Medicine
CoRR(2024)
摘要
In this paper, we aim to develop an open-source, multilingual language model
for medicine, that the benefits a wider, linguistically diverse audience from
different regions. In general, we present the contribution from the following
aspects: first, for multilingual medical-specific adaptation, we construct a
new multilingual medical corpus, that contains approximately 25.5B tokens
encompassing 6 main languages, termed as MMedC, that enables auto-regressive
training for existing general LLMs. second, to monitor the development of
multilingual LLMs in medicine, we propose a new multilingual medical
multi-choice question-answering benchmark with rationale, termed as MMedBench;
third, we have assessed a number of popular, opensource large language models
(LLMs) on our benchmark, along with those further auto-regressive trained on
MMedC, as a result, our final model, termed as MMedLM 2, with only 7B
parameters, achieves superior performance compared to all other open-source
models, even rivaling GPT-4 on MMedBench. We will make the resources publicly
available, including code, model weights, and datasets.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要