Background: Prognostic models are often used to assess the quality of healthcare. Several scores were developed to predict mortality after cardiac surgery, but none has reached optimal performance in subsequent validations. We validate the most used scores (EUROSCORE I and II, STS, and ACEF) on a cohort of cardiac-surgery patients, assessing their robustness against case-mix changes.
Methods: The scores were validated on 14,559 patients admitted to 16 Italian cardiosurgical ICUs participating to Margherita-Prosafe project in 2014 and 2015. Calibration was assessed through Hosmer-Lemeshow Test, standardized mortality ratio, and GiViTI calibration test and belt. Discrimination was measured by the area under the ROC curve.
Results: The study included 10,317 patients who were eligible to the calculation of the STS Score (4156 isolated valve, 4681 isolated CABG and 1480 single valve and CABG) which calibrated well in these subgroups. The ACEF Score and EUROSCORE I and II were available for 14,139, and 14,071 patients, respectively. EUROSCORE I significantly overestimated mortality; EUROSCORE II calibrated well overall, but underestimated mortality of patients undergoing complex surgery and non-elective ones. The ACEF Score calibrated poorly in elective and non-elective patients. Discrimination was acceptable for all models (AUC>0.70), but not for the ACEF Score.
Conclusions: Cardiac surgery scores calibrate poorly when the case-mix of validation and development samples differs. To grant reliability for benchmarking, they should be validated in the clinical settings on which they are applied and updated periodically. Advanced statistical tools are essential for the correct interpretation and application of severity scores.