Evaluation of marker selection methods and statistical models for chronological age prediction based on DNA methylation

Leg Med (Tokyo). 2020 Nov:47:101744. doi: 10.1016/j.legalmed.2020.101744. Epub 2020 Jul 1.

Abstract

In forensic investigation, retrieving biological information from DNA evidence is a promising field of interest. One of the applications is on the estimation of the age of the donor based on DNA methylation. A large number of studies focused on age prediction using the 450 K Human Methylation Beadchip. Various marker selection methods and prediction models have been considered. However, there is a lack of research evaluating different high-dimensional variable selection methods of CpG sites with various models for age prediction. The aim of this study is to evaluate four variable selection methods (forward selection, LASSO, elastic net and SCAD) combined with a classical statistical model and sophisticated machine learning models based on the mean absolute deviation (MAD) and the root-mean-square error (RMSE). We used publicly available 450 K data set containing 991 whole blood samples (age 19-101 years). We found that the multiple linear regression model with 16 markers selected from the forward selection method performed very well in age prediction (MAD = 3.76 years and RMSE = 5.01 years). On the other hand, the highly advanced ultrahigh dimensional variable selection methods and sophisticated machine learning algorithms appeared unnecessary for age prediction based on DNA methylation.

Keywords: Age prediction; DNA methylation; Forward selection; LASSO; Machine learning; Multiple linear regression.

MeSH terms

  • Adult
  • Aged
  • Aged, 80 and over
  • Aging / genetics*
  • Algorithms*
  • CpG Islands / genetics*
  • DNA Methylation*
  • Female
  • Forecasting
  • Forensic Genetics / methods
  • Humans
  • Linear Models
  • Machine Learning*
  • Male
  • Middle Aged
  • Models, Statistical*
  • Sequence Analysis, DNA / methods*
  • Young Adult