How soon will surgeons become mere technicians? Chatbot performance in managing clinical scenarios

Darren S Bryan; Joseph J Platz; Keith S Naunheim; Mark K Ferguson; Research in Artificial Intelligence Development for Surgery (RAIDS) Working Group

doi:10.1016/j.jtcvs.2024.11.006

How soon will surgeons become mere technicians? Chatbot performance in managing clinical scenarios

J Thorac Cardiovasc Surg. 2024 Nov 12:S0022-5223(24)01034-1. doi: 10.1016/j.jtcvs.2024.11.006. Online ahead of print.

Authors

Darren S Bryan¹, Joseph J Platz², Keith S Naunheim², Mark K Ferguson³; Research in Artificial Intelligence Development for Surgery (RAIDS) Working Group

Collaborators

Research in Artificial Intelligence Development for Surgery (RAIDS) Working Group:
Ghulam Abbas, Mara Antonoff, Sharon Ben-Or, Caitlin Demarest, David Finley, Robert Cameron, John Kuckelman, Svetlana Kotova, Ian Makey, Meredith Harrison, Philip Linden, Alexander Leung, Shari Meyerson, Daniel Miller, G Darby Pope, Daniel Raymond, Uma Sachdeva, Desiree Steimer, Eric Toloz, Ruchi Thanawala, Brian Whang

Affiliations

¹ Department of Surgery, University of Chicago, Chicago, Ill. Electronic address: dbryan@uchicago.edu.
² Department of Surgery, St Louis University, St Louis, Mo.
³ Department of Surgery, University of Chicago, Chicago, Ill.

PMID: 39536965
DOI: 10.1016/j.jtcvs.2024.11.006

Abstract

Background: Chatbot use has developed a presence in medicine and surgery and has been proposed to help guide clinical decision making. However, the accuracy of information provided by artificial intelligence (AI) platforms has been called into question. We evaluated the performance of 4 popular chatbots on a board-style examination and compared results with a group of board-certified thoracic surgeons.

Methods: Clinical scenarios were developed within domains based on the American Board of Thoracic Surgery (ABTS) Qualifying Exam. Each scenario included 3 stems written with the Key Feature methodology related to diagnosis, evaluation, and treatment. Ten scenarios were presented to ChatGPT-4, Bard (now Gemini), Perplexity, and Claude 2, as well as to randomly selected ABTS-certified surgeons. The maximum possible score was 3 points per scenario. Critical failures were identified during exam development; if they occurred in any of the 3 stems the entire question received a score of 0. The Mann-Whitney U test was used to compare surgeon scores and chatbot scores.

Results: Examinations were completed by 21 surgeons, the majority of whom (n = 14; 66%) practiced in academic or university settings. The median score per scenario was 1.06 for chatbots, compared to 1.88 for surgeons (difference, 0.66; P = .019). Surgeon median scores were better than chatbot median scores for all except 2 scenarios. Chatbot answers were significantly more likely to be deemed critical failures compared to those provided by surgeons (median, 0.50 per chatbot/scenario vs 0.19 per surgeon/scenario; P = .016).

Conclusions: Four popular chatbots performed at a significantly lower level than board-certified surgeons. Implementation of AI should be undertaken with caution in clinical decision making.

Keywords: AI; artificial intelligence; chatbot; education; key features; surgery; thoracic surgery.