Systematic evaluation of the DeepSeek large language model for clinical diagnostic reasoning

Wang, Yang; He, Yang; Qin, Xuchang; Hong, Yucai; Chen, Lin; Zhang, Jing; Ni, Hongying; Zhang, Zhongheng

Systematic evaluation of the DeepSeek large language model for clinical diagnostic reasoning

Yang Wang, Yang He, Xuchang Qin, Yucai Hong, Lin Chen, Jing Zhang, Hongying Ni and Zhongheng Zhang

PLOS ONE, 2026, vol. 21, issue 5, 1-11

Abstract: Background: Artificial intelligence (AI) is undergoing an era of transformative advancement, particularly through the emergence of Transformer-based large language models (LLMs). While these systems demonstrate strong reasoning and generalization capabilities, their clinical applicability, particularly in emergency and critical care decision-making, remains underexplored.. In time-sensitive settings, diagnostic reasoning must align rigorously with evidence-based standards and ensure the relevance of timing to clinical decisions. Objective: This study aims to provide a preliminary evaluation of the decision-support performance of the DeepSeek model in acute medical scenarios. We systematically evaluate its diagnostic reasoning, temporal consistency of recommendations, and adherence to evidence-based critical care protocols using standardized case-based assessments. Methods: Twenty-nine representative clinical cases were extracted from the Merck Manual of Diagnosis and Therapy, a widely used medical reference providing standardized case descriptions. The model’s outputs were evaluated across four decision-making dimensions: differential diagnosis, diagnostic testing, final diagnosis, and management planning. Human raters scored each response for accuracy, and multivariable linear regression was applied to assess associations between performance and case parameters (age, gender, and Rapid Emergency Medicine Score [REMS]). Results: DeepSeek achieved an overall mean accuracy of 82.9% (95% CI: 80.2–85.6%) across all cases. Accuracy peaked in final diagnosis (97.7%), but declined in differential diagnosis (73.0%). Model performance showed no significant variation across demographic or severity strata. Conclusions: DeepSeek shows promising performance in structured case-based diagnostic tasks, particularly in confirmatory diagnostic reasoning. However, its early-stage reasoning and handling of ambiguous cases require enhancement. Future studies using larger and more diverse clinical datasets are needed to further evaluate the model’s robustness and potential clinical applicability.

Date: 2026
References: View complete reference list from CitEc
Citations:

Downloads: (external link)
https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0346078 (text/html)
https://journals.plos.org/plosone/article/file?id= ... 46078&type=printable (application/pdf)

Related works:
This item may be available elsewhere in EconPapers: Search for items with the same title.

Export reference: BibTeX RIS (EndNote, ProCite, RefMan) HTML/Text

Persistent link: https://EconPapers.repec.org/RePEc:plo:pone00:0346078

DOI: 10.1371/journal.pone.0346078

Access Statistics for this article

More articles in PLOS ONE from Public Library of Science
Bibliographic data for series maintained by plosone ().