Evaluating Large Language Models in Dental Anesthesiology: A Comparative Analysis of ChatGPT-4, Claude 3 Opus, and Gemini 1.0 on the Japanese Dental Society of Anesthesiology Board Certification Exam

Misaki Fujimoto; Hidetaka Kuroda; Tomomi Katayama; Atsuki Yamaguchi; Norika Katagiri; Keita Kagawa; Shota Tsukimoto; Akito Nakano; Uno Imaizumi; Aiji Sato-Boku; Naotaka Kishimoto; Tomoki Itamiya; Kanta Kido; Takuro Sanuki

doi:10.7759/cureus.70302

Evaluating Large Language Models in Dental Anesthesiology: A Comparative Analysis of ChatGPT-4, Claude 3 Opus, and Gemini 1.0 on the Japanese Dental Society of Anesthesiology Board Certification Exam

Cureus. 2024 Sep 27;16(9):e70302. doi: 10.7759/cureus.70302. eCollection 2024 Sep.

Affiliations

¹ Department of Dental Anesthesiology, Kanagawa Dental University, Yokosuka, JPN.
² Department of Liberal Arts Education, Kanagawa Dental University, Yokosuka, JPN.
³ Department of Anesthesiology, Aichi Gakuin University, Nagoya, JPN.
⁴ Division of Dental Anesthesiology, Faculty of Dentistry and Graduate School of Medical and Dental Sciences, Niigata University, Niigata, JPN.
⁵ Department of Dental Anesthesiology, Hokkaido University, Sapporo, JPN.

Abstract

Purpose Large language models (LLMs) are increasingly employed across various fields, including medicine and dentistry. In the field of dental anesthesiology, LLM is expected to enhance the efficiency of information gathering, patient outcomes, and education. This study evaluates the performance of different LLMs in answering questions from the Japanese Dental Society of Anesthesiology Board Certification Examination (JDSABCE) to determine their utility in dental anesthesiology. Methods The study assessed three LLMs, ChatGPT-4 (OpenAI, San Francisco, California, United States), Gemini 1.0 (Google, Mountain View, California, United States), and Claude 3 Opus (Anthropic, San Francisco, California, United States), using multiple-choice questions from the 2020 to 2022 JDSABCE exams. Each LLM answered these questions three times. The study excluded questions involving figures or deemed inappropriate. The primary outcome was the accuracy rate of each LLM, with secondary analysis focusing on six subgroups: (1) basic physiology necessary for general anesthesia, (2) local anesthesia, (3) sedation and general anesthesia, (4) diseases and patient management methods that pose challenges in systemic management, (5) pain management, and (6) shock and cardiopulmonary resuscitation. Statistical analysis was performed using one-way ANOVA with Dunnett's multiple comparisons, with a significance threshold of p<0.05. Results ChatGPT-4 achieved a correct answer rate of 51.2% (95% CI: 42.78-60.56, p=0.003) and Claude 3 Opus 47.4% (95% CI: 43.45-51.44, p<0.001), both significantly higher than Gemini 1.0, which had a rate of 30.3% (95% CI: 26.53-34.14). In subgroup analyses, ChatGPT-4 and Claude 3 Opus demonstrated superior performance in basic physiology, sedation and general anesthesia, and systemic management challenges compared to Gemini 1.0. Notably, ChatGPT-4 excelled in questions related to systemic management (62.5%) and Claude 3 Opus in pain management (61.53%). Conclusions ChatGPT-4 and Claude 3 Opus exhibit potential for use in dental anesthesiology, outperforming Gemini 1.0. However, their current accuracy rates are insufficient for reliable clinical use. These findings have significant implications for dental anesthesiology practice and education, including educational support, clinical decision support, and continuing education. To enhance LLM utility in dental anesthesiology, it is crucial to increase the availability of high-quality information online and refine prompt engineering to better guide LLM responses.

Keywords: chatgpt; claude; dental anesthesiology; gemini; generative artificial intelligence; large language models.