Assessing the feasibility and external validity of natural language processing-extracted data for advanced lung cancer patients

Lung Cancer. 2025 Jan 4:199:108080. doi: 10.1016/j.lungcan.2025.108080. Online ahead of print.

Abstract

Background: Manual extraction of real-world clinical data for research can be time-consuming and prone to error. We assessed the feasibility of using natural language processing (NLP), an AI technique, to automate data extraction for patients with advanced lung cancer (aLC). We assessed the external validity of our NLP-extracted data by comparing our findings to those reported in the literature.

Methods: Patients diagnosed with stage IIIB or IV lung cancer between January 2015 to December 2017 at Princess Margaret Cancer Centre who received at least one dose of systemic therapy were included. Their electronic health records were provided to Pentavere's NLP platform, DARWENTM, in March 2019. Descriptive statistics summarized baseline patient and cancer characteristics, molecular biomarkers, and first-line systemic therapies. Cox multivariate models were used to evaluate prognostic factors for advanced non-small cell lung cancer (NSCLC) and small-cell lung cancer (SCLC) cohort.

Result: NLP extracted clinical information (n = 333 patients) in a total of 8 hours, with only a few missing data for smoking status (n = 2), and Eastern Cooperative Oncology Group (ECOG) status (n = 5). Baseline patient and cancer characteristics summarized from NLP-extracted data were comparable to those in previous studies and population reports. For NSCLC patients, being male (HR 1.44, 95 % CI [1.04, 2.00]), having worse ECOG (1.48 [1.22, 1.81]), and having liver (2.24 [1.45, 3.46]), bone (2.09 [1.48, 2.96]), or lung metastases (2.54 [1.05, 2.26]) were associated with worse survival outcomes. For SCLC patients, having older age (HR 1.70 per 10 years, 95 % CI [1.10, 2.63]) and liver metastases (3.81 [1.61, 9.01]) were associated with worse survival outcomes.

Conclusion: Our study demonstrated that automated data extraction using NLP is feasible and time efficient. Additionally, the NLP-extracted data can be used to identify valid and useful clinical endpoints for research. NLP holds significant potential to accelerate the extraction of real-world data for future observational studies.

Keywords: Electronic health record; Natural language processing; Observational studies.