Artificial Intelligence Large Language Models Address Anterior Cruciate Ligament Reconstruction: Superior Clarity and Completeness by Gemini Compared With ChatGPT-4 in Response to American Academy of Orthopaedic Surgeons Clinical Practice Guidelines

Arthroscopy. 2024 Sep 21:S0749-8063(24)00736-9. doi: 10.1016/j.arthro.2024.09.020. Online ahead of print.

Abstract

Purpose: To assess the ability of ChatGPT-4 and Gemini to generate accurate and relevant responses to the 2022 American Academy of Orthopaedic Surgeons (AAOS) Clinical Practice Guidelines (CPG) for anterior cruciate ligament reconstruction (ACLR).

Methods: Responses from ChatGPT-4 and Gemini to prompts derived from all 15 AAOS guidelines were evaluated by 7 fellowship-trained orthopaedic sports medicine surgeons using a structured questionnaire assessing 5 key characteristics on a scale from 1 to 5. The prompts were categorized into 3 areas: diagnosis and preoperative management, surgical timing and technique, and rehabilitation and prevention. Statistical analysis included mean scoring, standard deviation, and 2-sided t tests to compare the performance between the 2 large language models (LLMs). Scores were then evaluated for inter-rater reliability (IRR).

Results: Overall, both LLMs performed well with mean scores >4 for the 5 key characteristics. Gemini demonstrated superior performance in overall clarity (4.848 ± 0.36 vs 4.743 ± 0.481, P = .034), but all other characteristics demonstrated nonsignificant differences (P > .05). Gemini also demonstrated superior clarity in the surgical timing and technique (P = .038) as well as the prevention and rehabilitation (P = .044) subcategories. Additionally, Gemini had superior performance completeness scores in the rehabilitation and prevention subcategory (P = .044), but no statistically significant differences were found amongst the other subcategories. The overall IRR was found to be 0.71 (moderate).

Conclusions: Both Gemini and ChatGPT-4 demonstrate an overall good ability to generate accurate and relevant responses to question prompts based on the 2022 AAOS CPG for ACLR. However, Gemini demonstrated superior clarity in multiple domains in addition to superior completeness for questions pertaining to rehabilitation and prevention.

Clinical relevance: The current study addresses a current gap in the LLM and ACLR literature by comparing the performance of ChatGPT-4 to Gemini, which is growing in popularity with more than 300 million individual uses in May 2024 alone. Moreover, the results demonstrated superior performance of Gemini in both clarity and completeness, which are critical elements of a tool being used by patients for educational purposes. Additionally, the current study uses question prompts based on the AAOS CPG, which may be used as a method of standardization for future investigations on performance of LLM platforms. Thus, the results of this study may be of interest to both the readership of Arthroscopy and patients.