Quantifying the reasoning abilities of LLMs on clinical cases

Qiu, Pengcheng; Wu, Chaoyi; Liu, Shuyu; Fan, Yanjie; Zhao, Weike; Chen, Zhuoxia; Gu, Hongfei; Peng, Chuanjin; Zhang, Ya; Wang, Yanfeng; Xie, Weidi

Quantifying the reasoning abilities of LLMs on clinical cases

Pengcheng Qiu, Chaoyi Wu, Shuyu Liu, Yanjie Fan, Weike Zhao, Zhuoxia Chen, Hongfei Gu, Chuanjin Peng, Ya Zhang, Yanfeng Wang () and Weidi Xie ()
Additional contact information
Pengcheng Qiu: Shanghai Jiao Tong University
Chaoyi Wu: Shanghai Jiao Tong University
Shuyu Liu: Shanghai Jiao Tong University
Yanjie Fan: Xin Hua Hospital Affiliated to Shanghai Jiao Tong University School of Medicine
Weike Zhao: Shanghai Jiao Tong University
Zhuoxia Chen: China Mobile Communications Group Shanghai Co., Ltd.
Hongfei Gu: China Mobile Communications Group Shanghai Co., Ltd.
Chuanjin Peng: China Mobile Communications Group Shanghai Co., Ltd.
Ya Zhang: Shanghai Jiao Tong University
Yanfeng Wang: Shanghai Jiao Tong University
Weidi Xie: Shanghai Jiao Tong University

Nature Communications, 2025, vol. 16, issue 1, 1-14

Abstract: Abstract Recent advances in reasoning-enhanced large language models (LLMs) show promise, yet their application in professional medicine, especially the evaluation of their reasoning process, remains underexplored. We present MedR-Bench, a benchmark of 1453 structured patient cases with reference reasoning derived from clinical case reports, spanning 13 body systems and 10 specialties across common and rare diseases. Our evaluation framework covers three stages of care: examination recommendation, diagnostic decision-making, and treatment planning. To assess reasoning quality, we develop the Reasoning Evaluator, an automated scorer of written reasoning along efficiency, factual accuracy, and completeness. We evaluate seven state-of-the-art reasoning LLMs. Here we show that current models exceed 85% accuracy on simple diagnostic tasks when sufficient examination results are available, but performance drops on examination recommendation and treatment planning. Reasoning is generally factual, yet critical steps are often missing. Open-source models are closing the gap with proprietary systems, highlighting potential for more accessible, equitable clinical AI.

Date: 2025
References: Add references at CitEc
Citations:

Downloads: (external link)
https://www.nature.com/articles/s41467-025-64769-1 Abstract (text/html)

Related works:
This item may be available elsewhere in EconPapers: Search for items with the same title.

Export reference: BibTeX RIS (EndNote, ProCite, RefMan) HTML/Text

Persistent link: https://EconPapers.repec.org/RePEc:nat:natcom:v:16:y:2025:i:1:d:10.1038_s41467-025-64769-1

Ordering information: This journal article can be ordered from
https://www.nature.com/ncomms/

DOI: 10.1038/s41467-025-64769-1

Access Statistics for this article

Nature Communications is currently edited by Nathalie Le Bot, Enda Bergin and Fiona Gillespie

More articles in Nature Communications from Nature
Bibliographic data for series maintained by Sonal Shukla () and Springer Nature Abstracting and Indexing ().