Extracting structured information from unstructured histopathology reports using generative pre-trained transformer 4 (GPT-4)

Daniel Truhn; Chiara Ml Loeffler; Gustav Müller-Franzes; Sven Nebelung; Katherine J Hewitt; Sebastian Brandner; Keno K Bressem; Sebastian Foersch; Jakob Nikolas Kather

doi:10.1002/path.6232

Extracting structured information from unstructured histopathology reports using generative pre-trained transformer 4 (GPT-4)

J Pathol. 2024 Mar;262(3):310-319. doi: 10.1002/path.6232. Epub 2023 Dec 14.

Authors

Daniel Truhn¹, Chiara Ml Loeffler^{2

3

4}, Gustav Müller-Franzes¹, Sven Nebelung¹, Katherine J Hewitt^{2

4}, Sebastian Brandner⁵, Keno K Bressem⁶, Sebastian Foersch⁷, Jakob Nikolas Kather^{2

3

8

9}

Affiliations

¹ Department of Diagnostic and Interventional Radiology, University Hospital RWTH Aachen, Aachen, Germany.
² Else Kroener Fresenius Center for Digital Health, Technical University Dresden, Dresden, Germany.
³ Department of Medicine I, University Hospital Dresden, Dresden, Germany.
⁴ Department of Medicine III, University Hospital RWTH Aachen, Aachen, Germany.
⁵ Department of Neurosurgery, University Hospital Erlangen, Erlangen, Germany.
⁶ Department of Radiology, Charité - Universitätsmedizin Berlin, corporate member of Freie Universität Berlin and Humboldt-Universität zu Berlin, Berlin, Germany.
⁷ Institute of Pathology, University Medical Center Mainz, Mainz, Germany.
⁸ Medical Oncology, National Center for Tumor Diseases (NCT), University Hospital Heidelberg, Heidelberg, Germany.
⁹ Pathology and Data Analytics, Leeds Institute of Medical Research at St James's, University of Leeds, Leeds, UK.

PMID: 38098169
DOI: 10.1002/path.6232

Abstract

Deep learning applied to whole-slide histopathology images (WSIs) has the potential to enhance precision oncology and alleviate the workload of experts. However, developing these models necessitates large amounts of data with ground truth labels, which can be both time-consuming and expensive to obtain. Pathology reports are typically unstructured or poorly structured texts, and efforts to implement structured reporting templates have been unsuccessful, as these efforts lead to perceived extra workload. In this study, we hypothesised that large language models (LLMs), such as the generative pre-trained transformer 4 (GPT-4), can extract structured data from unstructured plain language reports using a zero-shot approach without requiring any re-training. We tested this hypothesis by utilising GPT-4 to extract information from histopathological reports, focusing on two extensive sets of pathology reports for colorectal cancer and glioblastoma. We found a high concordance between LLM-generated structured data and human-generated structured data. Consequently, LLMs could potentially be employed routinely to extract ground truth data for machine learning from unstructured pathology reports in the future. © 2023 The Authors. The Journal of Pathology published by John Wiley & Sons Ltd on behalf of The Pathological Society of Great Britain and Ireland.

Keywords: artificial intelligence; large language models; named entity recognition; natural language processing; pathology report; text mining.

Publication types

Research Support, Non-U.S. Gov't

MeSH terms

Glioblastoma*
Humans
Machine Learning
Precision Medicine*
United Kingdom

Abstract

Publication types

MeSH terms

Grants and funding