ChatGPT for Addressing Patient-centered Frequently Asked Questions in Glaucoma Clinical Practice

Ophthalmol Glaucoma. 2024 Oct 16:S2589-4196(24)00183-2. doi: 10.1016/j.ogla.2024.10.005. Online ahead of print.

Abstract

Purpose: Large language models such as ChatGPT-3.5 are often used by the public to answer questions related to daily life, including health advice. This study evaluated the responses of ChatGPT-3.5 in answering patient-centered frequently asked questions (FAQs) relevant in glaucoma clinical practice.

Design: Prospective cross-sectional survey.

Participants: Expert graders.

Methods: Twelve experts across a range of clinical, education, and research practices in optometry and ophthalmology. Over 200 patient-centric FAQs from authoritative professional society, hospital and advocacy websites were distilled and filtered into 40 questions across 4 themes: definition and risk factors, diagnosis and testing, lifestyle and other accompanying conditions, and treatment and follow-up. The questions were individually input into ChatGPT-3.5 to generate responses. The responses were graded by the 12 experts individually.

Main outcome measures: A 5-point Likert scale (1 = strongly disagree; 5 = strongly agree) was used to grade ChatGPT-3.5 responses across 4 domains: coherency, factuality, comprehensiveness, and safety.

Results: Across all themes and domains, median scores were all 4 ("agree"). Comprehensiveness had the lowest scores across domains (mean 3.7 ± 0.9), followed by factuality (mean 3.9 ± 0.9) and coherency and safety (mean 4.1 ± 0.8 for both). Examination of the individual 40 questions showed that 8 (20%), 17 (42.5%), 24 (60%), and 8 (20%) of the questions had average scores below 4 (i.e., below "agree") for the coherency, factuality, comprehensiveness, and safety domains, respectively. Free-text comments by the experts highlighted omissions of facts and comprehensiveness (e.g., secondary glaucoma) and remarked on the vagueness of some responses (i.e., that the response did not account for individual patient circumstances).

Conclusions: ChatGPT-3.5 responses to FAQs in glaucoma were generally agreeable in terms of coherency, factuality, comprehensiveness, and safety. However, areas of weakness were identified, precluding recommendations for routine use to provide patients with tailored counseling in glaucoma, especially with respect to development of glaucoma and its management.

Financial disclosure(s): Proprietary or commercial disclosure may be found in the Footnotes and Disclosures at the end of this article.

Keywords: Artificial intelligence; Chatbot; Collaborative care; Conversation agents; Large language models.