Towards building multilingual language model for medicine
Pengcheng Qiu,
Chaoyi Wu,
Xiaoman Zhang,
Weixiong Lin,
Haicheng Wang,
Ya Zhang,
Yanfeng Wang () and
Weidi Xie ()
Additional contact information
Pengcheng Qiu: Shanghai Jiao Tong University
Chaoyi Wu: Shanghai Jiao Tong University
Xiaoman Zhang: Shanghai Jiao Tong University
Weixiong Lin: Shanghai Jiao Tong University
Haicheng Wang: Shanghai Jiao Tong University
Ya Zhang: Shanghai Jiao Tong University
Yanfeng Wang: Shanghai Jiao Tong University
Weidi Xie: Shanghai Jiao Tong University
Nature Communications, 2024, vol. 15, issue 1, 1-15
Abstract:
Abstract The development of open-source, multilingual medical language models can benefit a wide, linguistically diverse audience from different regions. To promote this domain, we present contributions from the following: First, we construct a multilingual medical corpus, containing approximately 25.5B tokens encompassing 6 main languages, termed as MMedC, enabling auto-regressive domain adaptation for general LLMs; Second, to monitor the development of multilingual medical LLMs, we propose a multilingual medical multi-choice question-answering benchmark with rationale, termed as MMedBench; Third, we have assessed a number of open-source large language models (LLMs) on our benchmark, along with those further auto-regressive trained on MMedC. Our final model, MMed-Llama 3, with only 8B parameters, achieves superior performance compared to all other open-source models on both MMedBench and English benchmarks, even rivaling GPT-4. In conclusion, in this work, We present a large-scale corpus, a benchmark and a series of models to support the development of multilingual medical LLMs.
Date: 2024
References: View references in EconPapers View complete reference list from CitEc
Citations:
Downloads: (external link)
https://www.nature.com/articles/s41467-024-52417-z Abstract (text/html)
Related works:
This item may be available elsewhere in EconPapers: Search for items with the same title.
Export reference: BibTeX
RIS (EndNote, ProCite, RefMan)
HTML/Text
Persistent link: https://EconPapers.repec.org/RePEc:nat:natcom:v:15:y:2024:i:1:d:10.1038_s41467-024-52417-z
Ordering information: This journal article can be ordered from
https://www.nature.com/ncomms/
DOI: 10.1038/s41467-024-52417-z
Access Statistics for this article
Nature Communications is currently edited by Nathalie Le Bot, Enda Bergin and Fiona Gillespie
More articles in Nature Communications from Nature
Bibliographic data for series maintained by Sonal Shukla () and Springer Nature Abstracting and Indexing ().