Performance benchmarking of LLMs on Chinese national medical licensing education: Cross-lingual and question-type effects
Yuxia Tang,
Jian Chen and
Shouju Wang
PLOS ONE, 2026, vol. 21, issue 4, 1-8
Abstract:
Background: The cross-lingual and question-type variations affecting large language models (LLMs) accuracy on the Chinese national medical licensing educations remain insufficiently explored. Methods: In this cross-sectional study (May 13–20, 2025), 396 educational questions (198 English–Chinese pairs) were extracted from the Chinese national medical licensing examination. ChatGPT-4o, ChatGPT-o3, Gemini-2.5-pro, Deepseek-V3, Deepseek-R1, and Doubao-1.5-pro were prompted to provide answers. Responses were compared against reference answers, and accuracy was computed for three question types: basic knowledge (Type A), case analysis (Type B), and integrative judgment (Type C). Results: Across all question types and languages, Doubao-1.5-pro achieved the highest accuracy at 92.0% ± 1.3%, whereas ChatGPT-4o had the lowest accuracy at 82.8% ± 3.7%. There was a significant main effect of question type (P = 0.0038) but no main effect of language (P = 0.56). Post hoc tests confirmed that Type A performance exceeded Types B and C (P
Date: 2026
References: View complete reference list from CitEc
Citations:
Downloads: (external link)
https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0346518 (text/html)
https://journals.plos.org/plosone/article/file?id= ... 46518&type=printable (application/pdf)
Related works:
This item may be available elsewhere in EconPapers: Search for items with the same title.
Export reference: BibTeX
RIS (EndNote, ProCite, RefMan)
HTML/Text
Persistent link: https://EconPapers.repec.org/RePEc:plo:pone00:0346518
DOI: 10.1371/journal.pone.0346518
Access Statistics for this article
More articles in PLOS ONE from Public Library of Science
Bibliographic data for series maintained by plosone ().