Quality of Chatbot Information Related to Benign Prostatic Hyperplasia

Christopher J Warren; Nicolette G Payne; Victoria S Edmonds; Sandeep S Voleti; Mouneeb M Choudry; Nahid Punjani; Haider M Abdul-Muhsin; Mitchell R Humphreys

doi:10.1002/pros.24814

Quality of Chatbot Information Related to Benign Prostatic Hyperplasia

Prostate. 2025 Feb;85(2):175-180. doi: 10.1002/pros.24814. Epub 2024 Nov 8.

Authors

Christopher J Warren¹, Nicolette G Payne¹, Victoria S Edmonds¹, Sandeep S Voleti¹, Mouneeb M Choudry¹, Nahid Punjani¹, Haider M Abdul-Muhsin¹, Mitchell R Humphreys¹

Affiliation

¹ Department of Urology, Mayo Clinic Arizona Department of Urology, Phoenix, Arizona, USA.

PMID: 39513562
DOI: 10.1002/pros.24814

Abstract

Background: Large language model (LLM) chatbots, a form of artificial intelligence (AI) that excels at prompt-based interactions and mimics human conversation, have emerged as a tool for providing patients with information about urologic conditions. We aimed to examine the quality of information related to benign prostatic hyperplasia surgery from four chatbots and how they would respond to sample patient messages.

Methods: We identified the top three queries in Google Trends related to "treatment for enlarged prostate." These were entered into ChatGPT (OpenAI), Bard (Google), Bing AI (Microsoft), and Doximity GPT (Doximity), both unprompted and prompted for specific criteria (optimized). The chatbot-provided answers to each query were evaluated for overall quality by three urologists using the DISCERN instrument. Readability was measured with the built-in Flesch-Kincaid reading level tool in Microsoft Word. To assess the ability of chatbots to answer patient questions, we prompted the chatbots with a clinical scenario related to holmium laser enucleation of the prostate, followed by 10 questions that the National Institutes of Health recommends patients ask before surgery. Accuracy and completeness of responses were graded with Likert scales.

Results: Without prompting, the quality of information was moderate across all chatbots but improved significantly with prompting (mean [SD], 3.3 [1.2] vs. 4.4 [0.7] out of 5; p < 0.001). When answering simulated patient messages, the chatbots were accurate (mean [SD], 5.6 [0.4] out of 6) and complete (mean [SD], 2.8 [0.3] out of 3). Additionally, 98% (39/40) had a median score of 5 or higher for accuracy, which corresponds to "nearly all correct." The readability was poor, with a mean (SD) Flesch-Kincaid reading level grade of 12.1 (1.3) (unprompted).

Conclusions: LLM chatbots hold promise for patient education, but their effectiveness is limited by the need for careful prompting from the user and by responding at a reading level higher than that of most Americans (grade 8). Educating patients and physicians on optimal LLM interaction is crucial to unlock the full potential of chatbots.

Keywords: ChatGPT; DISCERN; HoLEP; artificial intelligence; large language model; patient education.

MeSH terms

Artificial Intelligence
Humans
Internet
Male
Patient Education as Topic / methods
Patient Education as Topic / standards
Prostatic Hyperplasia* / surgery