PARTITIONING AROUND MEDOIDS CLUSTERING AND RANDOM FOREST CLASSIFICATION FOR GIS-INFORMED IMPUTATION OF FLUORIDE CONCENTRATION DATA

Yu Gu; John S Preisser; Donglin Zeng; Poojan Shrestha; Molina Shah; Miguel A Simancas-Pallares; Jeannie Ginnis; Kimon Divaris

doi:10.1214/21-aoas1516

PARTITIONING AROUND MEDOIDS CLUSTERING AND RANDOM FOREST CLASSIFICATION FOR GIS-INFORMED IMPUTATION OF FLUORIDE CONCENTRATION DATA

Ann Appl Stat. 2022 Mar;16(1):551-572. doi: 10.1214/21-aoas1516. Epub 2022 Mar 28.

Authors

Yu Gu¹, John S Preisser¹, Donglin Zeng¹, Poojan Shrestha^{2

3}, Molina Shah², Miguel A Simancas-Pallares², Jeannie Ginnis², Kimon Divaris^{2

3}

Affiliations

¹ Department of Biostatistics, Gillings School of Global Public Health, University of North Carolina at Chapel Hill.
² Division of Pediatric and Public Health, Adams School of Dentistry, University of North Carolina at Chapel Hill.
³ Department of Epidemiology, Gillings School of Global Public Health, University of North Carolina at Chapel Hill.

Abstract

Community water fluoridation is an important component of oral health promotion, as fluoride exposure is a well-documented dental caries-preventive agent. Direct measurements of domestic water fluoride content provide valuable information regarding individuals' fluoride exposure and thus caries risk; however, they are logistically challenging to carry out at a large scale in oral health research. This article describes the development and evaluation of a novel method for the imputation of missing domestic water fluoride concentration data informed by spatial autocorrelation. The context is a state-wide epidemiologic study of pediatric oral health in North Carolina, where domestic water fluoride concentration information was missing for approximately 75% of study participants with clinical data on dental caries. A new machine-learning-based imputation method that combines partitioning around medoids clustering and random forest classification (PAMRF) is developed and implemented. Imputed values are filtered according to allowable error rates or target sample size, depending on the requirements of each application. In leave-one-out cross-validation and simulation studies, PAMRF outperforms four existing imputation approaches-two conventional spatial interpolation methods (i.e., inverse-distance weighting, IDW and universal kriging, UK) and two supervised learning methods (k-nearest neighbors, KNN and classification and regression trees, CART). The inclusion of multiply imputed values in the estimation of the association between fluoride concentration and dental caries prevalence resulted in essentially no change in PAMRF estimates but substantial gains in precision due to larger effective sample size. PAMRF is a powerful new method for the imputation of missing fluoride values where geographical information exists.

Keywords: clustering; missing values; multiple imputation; random forest; spatial interpolation.

Abstract

Grants and funding