Predictive analysis methods for human microbiome data with application to Parkinson's disease

PLoS One. 2020 Aug 24;15(8):e0237779. doi: 10.1371/journal.pone.0237779. eCollection 2020.

Abstract

Microbiome data consists of operational taxonomic unit (OTU) counts characterized by zero-inflation, over-dispersion, and grouping structure among samples. Currently, statistical testing methods are commonly performed to identify OTUs that are associated with a phenotype. The limitations of statistical testing methods include that the validity of p-values/q-values depend sensitively on the correctness of models and that the statistical significance does not necessarily imply predictivity. Predictive analysis using methods such as LASSO is an alternative approach for identifying associated OTUs and for measuring the predictability of the phenotype variable with OTUs and other covariate variables. We investigate three strategies of performing predictive analysis: (1) LASSO: fitting a LASSO multinomial logistic regression model to all OTU counts with specific transformation; (2) screening+GLM: screening OTUs with q-values returned by fitting a GLMM to each OTU, then fitting a GLM model using a subset of selected OTUs; (3) screening+LASSO: fitting a LASSO to a subset of OTUs selected with GLMM. We have conducted empirical studies using three simulation datasets generated using Dirichlet-multinomial models and a real gut microbiome data related to Parkinson's disease to investigate the performance of the three strategies for predictive analysis. Our simulation studies show that the predictive performance of LASSO with appropriate variable transformation works remarkably well on zero-inflated data. Our results of real data analysis show that Parkinson's disease can be predicted based on selected OTUs after the binary transformation, age, and sex with high accuracy (Error Rate = 0.199, AUC = 0.872, AUPRC = 0.912). These results provide strong evidences of the relationship between Parkinson's disease and the gut microbiome.

Publication types

  • Research Support, Non-U.S. Gov't

MeSH terms

  • Adult
  • Age Factors
  • Aged
  • Aged, 80 and over
  • Bacteria / classification*
  • Bacteria / genetics
  • Bacteria / isolation & purification
  • Cohort Studies
  • Computer Simulation
  • DNA, Bacterial / isolation & purification
  • Data Interpretation, Statistical*
  • Datasets as Topic
  • Female
  • Gastrointestinal Microbiome / genetics*
  • Humans
  • Logistic Models
  • Male
  • Middle Aged
  • Models, Biological*
  • Parkinson Disease / diagnosis*
  • Parkinson Disease / microbiology
  • Predictive Value of Tests
  • Prognosis
  • RNA, Ribosomal, 16S / genetics
  • Sex Factors

Substances

  • DNA, Bacterial
  • RNA, Ribosomal, 16S

Grants and funding

W.X. was supported by the Discovery Grants of Natural Sciences and Engineering Research Council of Canada (NSERC) (FUNDER number: RGPIN-2017-06672). L.L. was supported by the Discovery Grants of Natural Sciences and Engineering Research Council of Canada (NSERC) (FUNDER number: RGPIN-2019-07020) and Canada Foundation of Innovations (CFI). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.