Comparison of methods for imputing ordinal data using multivariate normal imputation: a case study of non-linear effects in a large cohort study

Stat Med. 2012 Dec 30;31(30):4164-74. doi: 10.1002/sim.5445. Epub 2012 Jul 24.

Abstract

Background: Multiple imputation is becoming increasingly popular for handling missing data, with Markov chain Monte Carlo assuming multivariate normality (MVN) a commonly used approach. Imputing categorical variables (which are clearly non-normal) using MVN imputation is challenging, and several approaches have been suggested. However, it remains unclear which approach should be preferred.

Methods: We explore methods for imputing ordinal variables using MVN imputation, including imputing as a continuous variable and as a set of indicators, and various methods for assigning imputed values to the possible categories (rounding), for estimating a non-linear association between an ordinal exposure and binary outcome. We introduce a new approach where we impute as continuous and assign imputed values into categories based on the mean indicators imputed in a separate round of imputation. We compare these approaches in a simple setting where we make 50% of data in an ordinal exposure missing completely at random, within an otherwise complete real dataset.

Results: Methods that impute the ordinal exposure as continuous distorted the non-linear exposure-outcome association by biasing the relationship towards linearity irrespective of the rounding method. In contrast, imputing using indicators preserved the non-linear association but not the marginal distribution of the ordinal variable.

Conclusions: Imputing ordinal variables as continuous can bias the estimation of the exposure-outcome association in the presence of non-linear relationships. Further work is needed to develop optimal methods for handling ordinal (and nominal) variables when using MVN imputation.

MeSH terms

  • Adult
  • Aged
  • Alcohol Drinking / adverse effects
  • Alcohol Drinking / epidemiology
  • Bias*
  • Cohort Studies*
  • Colonic Neoplasms / epidemiology
  • Colonic Neoplasms / etiology*
  • Computer Simulation
  • Data Interpretation, Statistical*
  • Humans
  • Life Style
  • Logistic Models
  • Middle Aged
  • Multivariate Analysis*
  • Nonlinear Dynamics
  • Queensland / epidemiology
  • Risk Factors