Racial and Ethnic Disparities in Predictive Accuracy of Machine Learning Algorithms Developed Using a National Database for 30-Day Complications Following Total Joint Arthroplasty

Christian A Pean; Anirudh Buddhiraju; Tony Lin-Wei Chen; Henry Hojoon Seo; Michelle R Shimizu; John G Esposito; Young-Min Kwon

doi:10.1016/j.arth.2024.10.060

Racial and Ethnic Disparities in Predictive Accuracy of Machine Learning Algorithms Developed Using a National Database for 30-Day Complications Following Total Joint Arthroplasty

J Arthroplasty. 2024 Oct 20:S0883-5403(24)01073-8. doi: 10.1016/j.arth.2024.10.060. Online ahead of print.

Authors

Christian A Pean¹, Anirudh Buddhiraju¹, Tony Lin-Wei Chen¹, Henry Hojoon Seo¹, Michelle R Shimizu¹, John G Esposito¹, Young-Min Kwon¹

Affiliation

¹ Bioengineering Laboratory, Department of Orthopaedic Surgery, Massachusetts General Hospital, Harvard Medical School, Boston, Massachusetts.

PMID: 39433263
DOI: 10.1016/j.arth.2024.10.060

Abstract

Background: While predictive capabilities of machine learning (ML) algorithms for hip and knee total joint arthroplasty (TJA) have been demonstrated in previous studies, their performance in racial and ethnic minority patients has not been investigated. This study aimed to assess the performance of ML algorithms in predicting 30-days complications following TJA in racial and ethnic minority patients.

Methods: A total of 267,194 patients undergoing primary TJA between 2013 and 2020 were identified from a national outcomes database. The patient cohort was stratified according to race, with further substratification into Hispanic or non-Hispanic ethnicity. There were two ML algorithms, histogram-based gradient boosting (HGB), and random forest (RF), that were modeled to predict 30-days complications following primary TJA in the overall population. They were subsequently assessed in each racial and ethnic subcohort using discrimination, calibration, accuracy, and potential clinical usefulness.

Results: Both models achieved excellent (Area under the curve (AUC) > 0.8) discrimination (AUC_HGB = AUC_RF = 0.86), calibration, and accuracy (HGB: slope = 1.00, intercept = -0.03, Brier score = 0.12; RF: slope = 0.97, intercept = 0.02, Brier score = 0.12) in the non-Hispanic White population (N = 224,073). Discrimination decreased in the White Hispanic (N = 10,429; AUC = 0.75 to 0.76), Black (N = 25,116; AUC = 0.77), Black Hispanic (N = 240; AUC = 0.78), Asian non-Hispanic (N = 4,809; AUC = 0.78 to 0.79), and overall (N = 267,194; AUC = 0.75 to 0.76) cohorts, but remained well-calibrated. We noted the poorest model discrimination (N = 1,870; AUC = 0.67 to 0.68) and calibration in the American-Indian cohort.

Conclusions: The ML algorithms demonstrate an inferior predictive ability for 30-days complications following primary TJA in racial and ethnic minorities when trained on existing healthcare big data. This may be attributed to the disproportionate underrepresentation of minority groups within these databases, as demonstrated by the smaller sample sizes available to train the ML models. The ML models developed using smaller datasets (e.g., in racial and ethnic minorities) may not be as accurate as larger datasets, highlighting the need for equity-conscious model development.

Level of evidence: III; retrospective cohort study.

Keywords: artificial intelligence; big data; health disparities; health inequity; machine learning; total joint arthroplasty.