Purpose: This study aimed to investigate the impact of different model retraining schemes and data partitioning on model performance in the task of COVID-19 classification on standard chest radiographs (CXRs), in the context of model generalizability.
Approach: Two datasets from the same institution were used: Set A (9860 patients, collected from 02/20/2020 to 02/03/2021) and Set B (5893 patients, collected from 03/15/2020 to 01/01/2022). An original deep learning (DL) model trained and tested in the task of COVID-19 classification using the initial partition of Set A achieved an area under the curve (AUC) value of 0.76, whereas Set B yielded a significantly lower value of 0.67. To explore this discrepancy, four separate strategies were undertaken on the original model: (1) retrain using Set B, (2) fine-tune using Set B, (3) regularization, and (4) repartition of the training set from Set A 200 times and report AUC values.
Results: The model achieved the following AUC values (95% confidence interval) for the four methods: (1) 0.61 [0.56, 0.66]; (2) 0.70 [0.66, 0.73], both on Set B; (3) 0.76 [0.72, 0.79] on the initial test partition of Set A and 0.68 [0.66, 0.70] on Set B; and (4) on repartitions of Set A. The lowest AUC value (0.66 [0.62, 0.69]) of the Set A repartitions was no longer significantly different from the initial 0.67 achieved on Set B.
Conclusions: Different data repartitions of the same dataset used to train a DL model demonstrated significantly different performance values that helped explain the discrepancy between Set A and Set B and further demonstrated the limitations of model generalizability.
Keywords: COVID-19; chest radiography; data partitioning; deep learning; model generalizability.
© 2024 The Authors.