A vision-language foundation model for precision oncology

Jinxi Xiang; Xiyue Wang; Xiaoming Zhang; Yinghua Xi; Feyisope Eweje; Yijiang Chen; Yuchen Li; Colin Bergstrom; Matthew Gopaulchan; Ted Kim; Kun-Hsing Yu; Sierra Willens; Francesca Maria Olguin; Jeffrey J Nirschl; Joel Neal; Maximilian Diehn; Sen Yang; Ruijiang Li

doi:10.1038/s41586-024-08378-w

A vision-language foundation model for precision oncology

Nature. 2025 Jan 8. doi: 10.1038/s41586-024-08378-w. Online ahead of print.

Authors

Jinxi Xiang^#¹, Xiyue Wang^#¹, Xiaoming Zhang², Yinghua Xi¹, Feyisope Eweje¹, Yijiang Chen¹, Yuchen Li¹, Colin Bergstrom³, Matthew Gopaulchan¹, Ted Kim¹, Kun-Hsing Yu⁴, Sierra Willens³, Francesca Maria Olguin³, Jeffrey J Nirschl², Joel Neal³, Maximilian Diehn¹, Sen Yang⁵, Ruijiang Li^{6

7}

Affiliations

¹ Department of Radiation Oncology, Stanford University School of Medicine, Stanford, CA, USA.
² Department of Pathology, Stanford University School of Medicine, Stanford, CA, USA.
³ Department of Medicine (Oncology), Stanford University School of Medicine, Stanford, CA, USA.
⁴ Department of Biomedical Informatics, Harvard Medical School, Boston, MA, USA.
⁵ Department of Radiation Oncology, Stanford University School of Medicine, Stanford, CA, USA. seny@stanford.edu.
⁶ Department of Radiation Oncology, Stanford University School of Medicine, Stanford, CA, USA. rli2@stanford.edu.
⁷ Stanford Institute for Human-Centered Artificial Intelligence, Stanford, CA, USA. rli2@stanford.edu.

^# Contributed equally.

PMID: 39779851
DOI: 10.1038/s41586-024-08378-w

Abstract

Clinical decision-making is driven by multimodal data, including clinical notes and pathological characteristics. Artificial intelligence approaches that can effectively integrate multimodal data hold significant promise in advancing clinical care^1,2. However, the scarcity of well-annotated multimodal datasets in clinical settings has hindered the development of useful models. In this study, we developed the Multimodal transformer with Unified maSKed modeling (MUSK), a vision-language foundation model designed to leverage large-scale, unlabelled, unpaired image and text data. MUSK was pretrained on 50 million pathology images from 11,577 patients and one billion pathology-related text tokens using unified masked modelling. It was further pretrained on one million pathology image-text pairs to efficiently align the vision and language features. With minimal or no further training, MUSK was tested in a wide range of applications and demonstrated superior performance across 23 patch-level and slide-level benchmarks, including image-to-text and text-to-image retrieval, visual question answering, image classification and molecular biomarker prediction. Furthermore, MUSK showed strong performance in outcome prediction, including melanoma relapse prediction, pan-cancer prognosis prediction and immunotherapy response prediction in lung and gastro-oesophageal cancers. MUSK effectively combined complementary information from pathology images and clinical reports and could potentially improve diagnosis and precision in cancer therapy.