Development of a large-scale medical visual question-answering dataset

Xiaoman Zhang; Chaoyi Wu; Ziheng Zhao; Weixiong Lin; Ya Zhang; Yanfeng Wang; Weidi Xie

doi:10.1038/s43856-024-00709-2

Development of a large-scale medical visual question-answering dataset

Commun Med (Lond). 2024 Dec 21;4(1):277. doi: 10.1038/s43856-024-00709-2.

Authors

Xiaoman Zhang^#^{1

2}, Chaoyi Wu^#^{1

2}, Ziheng Zhao^{1

2}, Weixiong Lin^{1

2}, Ya Zhang^{1

2}, Yanfeng Wang^{3

4}, Weidi Xie^{5

6}

Affiliations

¹ Shanghai Jiao Tong University, Shanghai, China.
² Shanghai Artificial Intelligence Laboratory, Shanghai, China.
³ Shanghai Jiao Tong University, Shanghai, China. wangyanfeng622@sjtu.edu.cn.
⁴ Shanghai Artificial Intelligence Laboratory, Shanghai, China. wangyanfeng622@sjtu.edu.cn.
⁵ Shanghai Jiao Tong University, Shanghai, China. weidi@sjtu.edu.cn.
⁶ Shanghai Artificial Intelligence Laboratory, Shanghai, China. weidi@sjtu.edu.cn.

^# Contributed equally.

PMID: 39709495
DOI: 10.1038/s43856-024-00709-2

Abstract

Background: Medical Visual Question Answering (MedVQA) enhances diagnostic accuracy and healthcare delivery by leveraging artificial intelligence to interpret medical images. This study aims to redefine MedVQA as a generation task that mirrors human-machine interaction and to develop a model capable of integrating complex visual and textual information.

Methods: We constructed a large-scale medical visual-question answering dataset, PMC-VQA, containing 227,000 VQA pairs across 149,000 images that span various modalities and diseases. We introduced a generative model that aligns visual information from a pre-trained vision encoder with a large language model. This model was initially trained on PMC-VQA and subsequently fine-tuned on multiple public benchmarks.

Results: Here, we show that our model significantly outperforms existing MedVQA models in generating relevant, accurate free-form answers. We also propose a manually verified test set that presents a greater challenge and serves as a robust measure to monitor the advancement of generative MedVQA methods.

Conclusions: The PMC-VQA dataset proves to be an essential resource for the research community, and our model marks a significant breakthrough in MedVQA. We maintain a leaderboard to facilitate comprehensive evaluation and comparison, providing a centralized resource for benchmarking state-of-the-art approaches.

Plain language summary

Medical images play a crucial role in healthcare, but interpreting them accurately can be challenging. This study developed an artificial intelligence system that can answer questions about medical images, similar to how a medical expert would explain findings to patients. We created a large collection of medical images paired with questions and answers to train our AI system, covering various types of medical scans and conditions. Our system can generate detailed, accurate responses to questions about medical images, performing better than existing approaches. The system and dataset we developed are freely available to researchers, which should help advance the field of medical image interpretation and ultimately improve healthcare delivery.