Background: Chatbot use has developed a presence in medicine and surgery and has been proposed to help guide clinical decision making. However, the accuracy of information provided by artificial intelligence (AI) platforms has been called into question. We evaluated the performance of 4 popular chatbots on a board-style examination and compared results with a group of board-certified thoracic surgeons.
Methods: Clinical scenarios were developed within domains based on the American Board of Thoracic Surgery (ABTS) Qualifying Exam. Each scenario included 3 stems written with the Key Feature methodology related to diagnosis, evaluation, and treatment. Ten scenarios were presented to ChatGPT-4, Bard (now Gemini), Perplexity, and Claude 2, as well as to randomly selected ABTS-certified surgeons. The maximum possible score was 3 points per scenario. Critical failures were identified during exam development; if they occurred in any of the 3 stems the entire question received a score of 0. The Mann-Whitney U test was used to compare surgeon scores and chatbot scores.
Results: Examinations were completed by 21 surgeons, the majority of whom (n = 14; 66%) practiced in academic or university settings. The median score per scenario was 1.06 for chatbots, compared to 1.88 for surgeons (difference, 0.66; P = .019). Surgeon median scores were better than chatbot median scores for all except 2 scenarios. Chatbot answers were significantly more likely to be deemed critical failures compared to those provided by surgeons (median, 0.50 per chatbot/scenario vs 0.19 per surgeon/scenario; P = .016).
Conclusions: Four popular chatbots performed at a significantly lower level than board-certified surgeons. Implementation of AI should be undertaken with caution in clinical decision making.
Keywords: AI; artificial intelligence; chatbot; education; key features; surgery; thoracic surgery.
Copyright © 2024. Published by Elsevier Inc.