A clinically accessible small multimodal radiology model and evaluation metric for chest X-ray findings
Juan Manuel Zambrano Chaves,
Shih-Cheng Huang,
Yanbo Xu,
Hanwen Xu,
Naoto Usuyama,
Sheng Zhang,
Fei Wang,
Yujia Xie,
Mahmoud Khademi,
Ziyi Yang,
Hany Awadalla,
Julia Gong,
Houdong Hu,
Jianwei Yang,
Chunyuan Li,
Jianfeng Gao,
Yu Gu,
Cliff Wong,
Mu Wei,
Tristan Naumann,
Muhao Chen,
Matthew P. Lungren,
Akshay Chaudhari,
Serena Yeung-Levy,
Curtis P. Langlotz,
Sheng Wang () and
Hoifung Poon ()
Additional contact information
Juan Manuel Zambrano Chaves: Microsoft Research
Shih-Cheng Huang: Stanford University
Yanbo Xu: Microsoft Research
Hanwen Xu: University of Washington
Naoto Usuyama: Microsoft Research
Sheng Zhang: Microsoft Research
Fei Wang: University of Southern California
Yujia Xie: Microsoft Research
Mahmoud Khademi: Microsoft Research
Ziyi Yang: Microsoft Research
Hany Awadalla: Microsoft Research
Julia Gong: Microsoft Research
Houdong Hu: Microsoft Research
Jianwei Yang: Microsoft Research
Chunyuan Li: Microsoft Research
Jianfeng Gao: Microsoft Research
Yu Gu: Microsoft Research
Cliff Wong: Microsoft Research
Mu Wei: Microsoft Research
Tristan Naumann: Microsoft Research
Muhao Chen: University of California
Matthew P. Lungren: Microsoft Research
Akshay Chaudhari: Stanford University
Serena Yeung-Levy: Stanford University
Curtis P. Langlotz: Stanford University
Sheng Wang: University of Washington
Hoifung Poon: Microsoft Research
Nature Communications, 2025, vol. 16, issue 1, 1-15
Abstract:
Abstract Large foundation models show promise in biomedicine but face challenges in clinical use due to performance gaps, accessibility, cost, and lack of scalable evaluation. Here we show that open-source small multimodal models can bridge these gaps in radiology by generating free-text findings from chest X-ray images. Our data-centric approach leverages 697K curated radiology image-text pairs to train a specialized, domain-adapted chest X-ray encoder. We integrate this encoder with pre-trained language models via a lightweight adapter that aligns image and text modalities. To enable robust, clinically relevant evaluation, we develop and validate CheXprompt, a GPT-4-based metric for assessing factual accuracy aligned with radiologists’ evaluations. Benchmarked with CheXprompt and other standard factuality metrics, LLaVA-Rad (7B) achieves state-of-the-art performance, outperforming much larger models like GPT-4V and Med-PaLM M (84B). While not immediately ready for real-time clinical deployment, LLaVA-Rad is a scalable, privacy-preserving and cost-effective step towards clinically adaptable multimodal AI for radiology.
Date: 2025
References: Add references at CitEc
Citations:
Downloads: (external link)
https://www.nature.com/articles/s41467-025-58344-x Abstract (text/html)
Related works:
This item may be available elsewhere in EconPapers: Search for items with the same title.
Export reference: BibTeX
RIS (EndNote, ProCite, RefMan)
HTML/Text
Persistent link: https://EconPapers.repec.org/RePEc:nat:natcom:v:16:y:2025:i:1:d:10.1038_s41467-025-58344-x
Ordering information: This journal article can be ordered from
https://www.nature.com/ncomms/
DOI: 10.1038/s41467-025-58344-x
Access Statistics for this article
Nature Communications is currently edited by Nathalie Le Bot, Enda Bergin and Fiona Gillespie
More articles in Nature Communications from Nature
Bibliographic data for series maintained by Sonal Shukla () and Springer Nature Abstracting and Indexing ().