Towards a holistic framework for multimodal LLM in 3D brain CT radiology report generation
Cheng-Yi Li,
Kao-Jung Chang (),
Cheng-Fu Yang,
Hsin-Yu Wu,
Wenting Chen,
Hritik Bansal,
Ling Chen,
Yi-Ping Yang,
Yu-Chun Chen,
Shih-Pin Chen,
Shih-Jen Chen,
Jiing-Feng Lirng,
Kai-Wei Chang () and
Shih-Hwa Chiou ()
Additional contact information
Cheng-Yi Li: National Yang Ming Chiao Tung University
Kao-Jung Chang: Taipei Veterans General Hospital
Cheng-Fu Yang: University of California
Hsin-Yu Wu: National Yang Ming Chiao Tung University
Wenting Chen: City University of Hong Kong
Hritik Bansal: University of California
Ling Chen: National Yang Ming Chiao Tung University
Yi-Ping Yang: National Yang Ming Chiao Tung University
Yu-Chun Chen: National Yang Ming Chiao Tung University
Shih-Pin Chen: National Yang Ming Chiao Tung University
Shih-Jen Chen: National Yang Ming Chiao Tung University
Jiing-Feng Lirng: National Yang Ming Chiao Tung University
Kai-Wei Chang: University of California
Shih-Hwa Chiou: Taipei Veterans General Hospital
Nature Communications, 2025, vol. 16, issue 1, 1-14
Abstract:
Abstract Multi-modal large language models (MLLMs) have transformed the landscape of modern healthcare, with automated radiology report generation (RRG) emerging as a cutting-edge application. While 2D MLLM-based RRG has been well established, its utility for 3D medical images remains largely unexplored. In this regard, we curate the 3D-BrainCT dataset (18,885 text-scan pairs) and develop BrainGPT, a clinically visual instruction-tuned (CVIT) model designed for 3D CT RRG. While we notice that the traditional LLM metrics failed to gauge the diagnostic quality of the RRG, we propose feature-oriented radiology task evaluation (FORTE), an evaluation scheme that captures the clinical essence of the generated reports. Here we show that BrainGPT achieves an average FORTE F1-score of 0.71 (degree = 0.661; landmark = 0.706; feature = 0.693, and impression = 0.779) and 74% of BrainGPT-generated reports were indistinguishable from human-written ground truth in a Turing-like test. Together, our work establishes a comprehensive framework encompassing dataset curation, anatomy-aware model fine-tuning, and the development of robust evaluation metrics for the RRG. By sharing our experience in 3D MLLM-based RRG, we aim to accelerate the expedition in human-machine collaboration for next-generation healthcare.
Date: 2025
References: Add references at CitEc
Citations:
Downloads: (external link)
https://www.nature.com/articles/s41467-025-57426-0 Abstract (text/html)
Related works:
This item may be available elsewhere in EconPapers: Search for items with the same title.
Export reference: BibTeX
RIS (EndNote, ProCite, RefMan)
HTML/Text
Persistent link: https://EconPapers.repec.org/RePEc:nat:natcom:v:16:y:2025:i:1:d:10.1038_s41467-025-57426-0
Ordering information: This journal article can be ordered from
https://www.nature.com/ncomms/
DOI: 10.1038/s41467-025-57426-0
Access Statistics for this article
Nature Communications is currently edited by Nathalie Le Bot, Enda Bergin and Fiona Gillespie
More articles in Nature Communications from Nature
Bibliographic data for series maintained by Sonal Shukla () and Springer Nature Abstracting and Indexing ().