Objective: To develop and validate machine learning models for predicting COVID-19 related hospitalization as early as CDC contact tracing using integrated CDC contact tracing and South Carolina medical claims data.
Methods: Using the dataset (n=82,073, 1/1/2018 - 3/1/2020), we identified 3,305 patients with COVID-19 and were captured by contact tracing. We developed and validated machine learning models (i.e., support vector machine, random forest, XGboost), followed by multi-level validations and pilot statewide implementation.
Results: Using 10-cross validation, random forest outperformed other models (F1=0.872 for general hospitalization and 0.763 for COVID-19 related hospitalization), followed by XGBoost (F1=0.845 and 0.682) and support vector machine (F1=0.845 and 0.644). We identified new self-reported symptoms from contact tracing (e.g., fatigue, congestion, headache, loss of taste) that are highly predictive of hospitalization.
Conclusions: Our study demonstrated the feasibility of identifying individuals at risk of hospitalization at the time of contact tracing for early intervention and prevention.
Policy implications: Our findings demonstrate existing promise for leveraging CDC contact tracing for establishing a cost-effective statewide surveillance and generalizability for nationwide adoption for enhancing pandemic preparedness in the US.
Keywords: COVID-19; contact tracing; machine learning; medical claims; medical record linkage.