Evaluating Large Language Models in Dental Anesthesiology: A Comparative Analysis of ChatGPT-4, Claude 3 Opus, and Gemini 1.0 on the Japanese Dental Society of Anesthesiology Board Certification Exam

Cureus. 2024 Sep 27;16(9):e70302. doi: 10.7759/cureus.70302. eCollection 2024 Sep.

Abstract

Purpose Large language models (LLMs) are increasingly employed across various fields, including medicine and dentistry. In the field of dental anesthesiology, LLM is expected to enhance the efficiency of information gathering, patient outcomes, and education. This study evaluates the performance of different LLMs in answering questions from the Japanese Dental Society of Anesthesiology Board Certification Examination (JDSABCE) to determine their utility in dental anesthesiology. Methods The study assessed three LLMs, ChatGPT-4 (OpenAI, San Francisco, California, United States), Gemini 1.0 (Google, Mountain View, California, United States), and Claude 3 Opus (Anthropic, San Francisco, California, United States), using multiple-choice questions from the 2020 to 2022 JDSABCE exams. Each LLM answered these questions three times. The study excluded questions involving figures or deemed inappropriate. The primary outcome was the accuracy rate of each LLM, with secondary analysis focusing on six subgroups: (1) basic physiology necessary for general anesthesia, (2) local anesthesia, (3) sedation and general anesthesia, (4) diseases and patient management methods that pose challenges in systemic management, (5) pain management, and (6) shock and cardiopulmonary resuscitation. Statistical analysis was performed using one-way ANOVA with Dunnett's multiple comparisons, with a significance threshold of p<0.05. Results ChatGPT-4 achieved a correct answer rate of 51.2% (95% CI: 42.78-60.56, p=0.003) and Claude 3 Opus 47.4% (95% CI: 43.45-51.44, p<0.001), both significantly higher than Gemini 1.0, which had a rate of 30.3% (95% CI: 26.53-34.14). In subgroup analyses, ChatGPT-4 and Claude 3 Opus demonstrated superior performance in basic physiology, sedation and general anesthesia, and systemic management challenges compared to Gemini 1.0. Notably, ChatGPT-4 excelled in questions related to systemic management (62.5%) and Claude 3 Opus in pain management (61.53%). Conclusions ChatGPT-4 and Claude 3 Opus exhibit potential for use in dental anesthesiology, outperforming Gemini 1.0. However, their current accuracy rates are insufficient for reliable clinical use. These findings have significant implications for dental anesthesiology practice and education, including educational support, clinical decision support, and continuing education. To enhance LLM utility in dental anesthesiology, it is crucial to increase the availability of high-quality information online and refine prompt engineering to better guide LLM responses.

Keywords: chatgpt; claude; dental anesthesiology; gemini; generative artificial intelligence; large language models.