A vision–language foundation model for precision oncology
Jinxi Xiang,
Xiyue Wang,
Xiaoming Zhang,
Yinghua Xi,
Feyisope Eweje,
Yijiang Chen,
Yuchen Li,
Colin Bergstrom,
Matthew Gopaulchan,
Ted Kim,
Kun-Hsing Yu,
Sierra Willens,
Francesca Maria Olguin,
Jeffrey J. Nirschl,
Joel Neal,
Maximilian Diehn,
Sen Yang () and
Ruijiang Li ()
Additional contact information
Jinxi Xiang: Stanford University School of Medicine
Xiyue Wang: Stanford University School of Medicine
Xiaoming Zhang: Stanford University School of Medicine
Yinghua Xi: Stanford University School of Medicine
Feyisope Eweje: Stanford University School of Medicine
Yijiang Chen: Stanford University School of Medicine
Yuchen Li: Stanford University School of Medicine
Colin Bergstrom: Stanford University School of Medicine
Matthew Gopaulchan: Stanford University School of Medicine
Ted Kim: Stanford University School of Medicine
Kun-Hsing Yu: Harvard Medical School
Sierra Willens: Stanford University School of Medicine
Francesca Maria Olguin: Stanford University School of Medicine
Jeffrey J. Nirschl: Stanford University School of Medicine
Joel Neal: Stanford University School of Medicine
Maximilian Diehn: Stanford University School of Medicine
Sen Yang: Stanford University School of Medicine
Ruijiang Li: Stanford University School of Medicine
Nature, 2025, vol. 638, issue 8051, 769-778
Abstract:
Abstract Clinical decision-making is driven by multimodal data, including clinical notes and pathological characteristics. Artificial intelligence approaches that can effectively integrate multimodal data hold significant promise in advancing clinical care1,2. However, the scarcity of well-annotated multimodal datasets in clinical settings has hindered the development of useful models. In this study, we developed the Multimodal transformer with Unified maSKed modeling (MUSK), a vision–language foundation model designed to leverage large-scale, unlabelled, unpaired image and text data. MUSK was pretrained on 50 million pathology images from 11,577 patients and one billion pathology-related text tokens using unified masked modelling. It was further pretrained on one million pathology image–text pairs to efficiently align the vision and language features. With minimal or no further training, MUSK was tested in a wide range of applications and demonstrated superior performance across 23 patch-level and slide-level benchmarks, including image-to-text and text-to-image retrieval, visual question answering, image classification and molecular biomarker prediction. Furthermore, MUSK showed strong performance in outcome prediction, including melanoma relapse prediction, pan-cancer prognosis prediction and immunotherapy response prediction in lung and gastro-oesophageal cancers. MUSK effectively combined complementary information from pathology images and clinical reports and could potentially improve diagnosis and precision in cancer therapy.
Date: 2025
References: Add references at CitEc
Citations:
Downloads: (external link)
https://www.nature.com/articles/s41586-024-08378-w Abstract (text/html)
Access to the full text of the articles in this series is restricted.
Related works:
This item may be available elsewhere in EconPapers: Search for items with the same title.
Export reference: BibTeX
RIS (EndNote, ProCite, RefMan)
HTML/Text
Persistent link: https://EconPapers.repec.org/RePEc:nat:nature:v:638:y:2025:i:8051:d:10.1038_s41586-024-08378-w
Ordering information: This journal article can be ordered from
https://www.nature.com/
DOI: 10.1038/s41586-024-08378-w
Access Statistics for this article
Nature is currently edited by Magdalena Skipper
More articles in Nature from Nature
Bibliographic data for series maintained by Sonal Shukla () and Springer Nature Abstracting and Indexing ().