Clinical Research With Large Language Models Generated Writing-Clinical Research with AI-assisted Writing (CRAW) Study

Ivan A Huespe; Jorge Echeverri; Aisha Khalid; Indalecio Carboni Bisso; Carlos G Musso; Salim Surani; Vikas Bansal; Rahul Kashyap

doi:10.1097/CCE.0000000000000975

Clinical Research With Large Language Models Generated Writing-Clinical Research with AI-assisted Writing (CRAW) Study

Crit Care Explor. 2023 Oct 2;5(10):e0975. doi: 10.1097/CCE.0000000000000975. eCollection 2023 Oct.

Authors

Ivan A Huespe^{1

2}, Jorge Echeverri³, Aisha Khalid⁴, Indalecio Carboni Bisso¹, Carlos G Musso^{1

5}, Salim Surani^{6

7}, Vikas Bansal⁶, Rahul Kashyap^{6

8}

Affiliations

¹ Hospital Italiano de Buenos Aires, Buenos Aires, Argentina.
² Universidad de Buenos Aires, Buenos Aires, Argentina.
³ Universidad Javeriana, Bogotá, Colombia.
⁴ Harvard Medical School, Boston, MA.
⁵ Facultad de Ciencias de la Salud, Universidad Simon Bolivar, Barranquilla, Colombia.
⁶ Mayo Clinic, Rochester, MN.
⁷ Texas A&M University, College Station, TX.
⁸ WellSpan Health, York, PA.

Abstract

Importance: The scientific community debates Generative Pre-trained Transformer (GPT)-3.5's article quality, authorship merit, originality, and ethical use in scientific writing.

Objectives: Assess GPT-3.5's ability to craft the background section of critical care clinical research questions compared to medical researchers with H-indices of 22 and 13.

Design: Observational cross-sectional study.

Setting: Researchers from 20 countries from six continents evaluated the backgrounds.

Participants: Researchers with a Scopus index greater than 1 were included.

Main outcomes and measures: In this study, we generated a background section of a critical care clinical research question on "acute kidney injury in sepsis" using three different methods: researcher with H-index greater than 20, researcher with H-index greater than 10, and GPT-3.5. The three background sections were presented in a blinded survey to researchers with an H-index range between 1 and 96. First, the researchers evaluated the main components of the background using a 5-point Likert scale. Second, they were asked to identify which background was written by humans only or with large language model-generated tools.

Results: A total of 80 researchers completed the survey. The median H-index was 3 (interquartile range, 1-7.25) and most (36%) researchers were from the Critical Care specialty. When compared with researchers with an H-index of 22 and 13, GPT-3.5 was marked high on the Likert scale ranking on main background components (median 4.5 vs. 3.82 vs. 3.6 vs. 4.5, respectively; p < 0.001). The sensitivity and specificity to detect researchers writing versus GPT-3.5 writing were poor, 22.4% and 57.6%, respectively.

Conclusions and relevance: GPT-3.5 could create background research content indistinguishable from the writing of a medical researcher. It was marked higher compared with medical researchers with an H-index of 22 and 13 in writing the background section of a critical care clinical research question.

Keywords: Generative Pre-trained Transformer-3.5; article writing; artificial intelligence; clinical research; medical research.