Evaluating the Effectiveness of advanced large language models in medical Knowledge: A Comparative study using Japanese national medical examination

Mingxin Liu; Tsuyoshi Okuhara; Zhehao Dai; Wenbo Huang; Lin Gu; Hiroko Okada; Emi Furukawa; Takahiro Kiuchi

doi:10.1016/j.ijmedinf.2024.105673

Evaluating the Effectiveness of advanced large language models in medical Knowledge: A Comparative study using Japanese national medical examination

Int J Med Inform. 2025 Jan:193:105673. doi: 10.1016/j.ijmedinf.2024.105673. Epub 2024 Oct 28.

Authors

Mingxin Liu¹, Tsuyoshi Okuhara², Zhehao Dai³, Wenbo Huang⁴, Lin Gu⁵, Hiroko Okada⁶, Emi Furukawa⁷, Takahiro Kiuchi⁸

Affiliations

¹ Department of Health Communication, Graduate School of Medicine, The University of Tokyo, Tokyo, Japan. Electronic address: liumingxin98@g.ecc.u-tokyo.ac.jp.
² Department of Health Communication, School of Public Health, Graduate School of Medicine, The University of Tokyo, Tokyo, Japan. Electronic address: okuhara.hc@gmail.com.
³ Department of Cardiovascular Medicine, Graduate School of Medicine, The University of Tokyo, Tokyo, Japan. Electronic address: daizh@luke.ac.jp.
⁴ Department of Clinical Epidemiology and Health Economics, School of Public Health, The University of Tokyo, Tokyo, Japan. Electronic address: wenbohuang2020@gmail.com.
⁵ Center for Advanced Intelligence Project, RIKEN, Tokyo, Japan. Electronic address: Lin.gu@riken.jp.
⁶ Department of Health Communication, School of Public Health, Graduate School of Medicine, The University of Tokyo, Tokyo, Japan. Electronic address: sakura.hiro1119@gmail.com.
⁷ Department of Health Communication, School of Public Health, Graduate School of Medicine, The University of Tokyo, Tokyo, Japan. Electronic address: efurukawa-tho@umin.ac.jp.
⁸ Department of Health Communication, School of Public Health, Graduate School of Medicine, The University of Tokyo, Tokyo, Japan. Electronic address: kiuchi8818@gmail.com.

PMID: 39471700
DOI: 10.1016/j.ijmedinf.2024.105673

Abstract

Study aims and objectives. This study aims to evaluate the accuracy of medical knowledge in the most advanced LLMs (GPT-4o, GPT-4, Gemini 1.5 Pro, and Claude 3 Opus) as of 2024. It is the first to evaluate these LLMs using a non-English medical licensing exam. The insights from this study will guide educators, policymakers, and technical experts in the effective use of AI in medical education and clinical diagnosis.

Method: Authors inputted 790 questions from Japanese National Medical Examination into the chat windows of the LLMs to obtain responses. Two authors independently assessed the correctness. Authors analyzed the overall accuracy rates of the LLMs and compared their performance on image and non-image questions, questions of varying difficulty levels, general and clinical questions, and questions from different medical specialties. Additionally, authors examined the correlation between the number of publications and LLMs' performance in different medical specialties.

Results: GPT-4o achieved highest accuracy rate of 89.2% and outperformed the other LLMs in overall performance and each specific category. All four LLMs performed better on non-image questions than image questions, with a 10% accuracy gap. They also performed better on easy questions compared to normal and difficult ones. GPT-4o achieved a 95.0% accuracy rate on easy questions, marking it as an effective knowledge source for medical education. Four LLMs performed worst on "Gastroenterology and Hepatology" specialty. There was a positive correlation between the number of publications and LLM performance in different specialties.

Conclusions: GPT-4o achieved an overall accuracy rate close to 90%, with 95.0% on easy questions, significantly outperforming the other LLMs. This indicates GPT-4o's potential as a knowledge source for easy questions. Image-based questions and question difficulty significantly impact LLM accuracy. "Gastroenterology and Hepatology" is the specialty with the lowest performance. The LLMs' performance across medical specialties correlates positively with the number of related publications.

Keywords: Artificial intelligence; ChatGPT; Large language models; Medical education; Medical licensing examination.

Publication types

Comparative Study

MeSH terms

Artificial Intelligence*
Clinical Competence
Education, Medical*
Educational Measurement* / methods
Japan
Licensure, Medical