Evaluating the detection ability of a range of epistasis detection methods on simulated data for pure and impure epistatic models

PLoS One. 2022 Feb 18;17(2):e0263390. doi: 10.1371/journal.pone.0263390. eCollection 2022.

Abstract

Background: Numerous approaches have been proposed for the detection of epistatic interactions within GWAS datasets in order to better understand the drivers of disease and genetics.

Methods: A selection of state-of-the-art approaches were assessed. These included the statistical tests, fast-epistasis, BOOST, logistic regression and wtest; swarm intelligence methods, namely AntEpiSeeker, epiACO and CINOEDV; and data mining approaches, including MDR, GSS, SNPRuler and MPI3SNP. Data were simulated to provide randomly generated models with no individual main effects at different heritabilities (pure epistasis) as well as models based on penetrance tables with some main effects (impure epistasis). Detection of both two and three locus interactions were assessed across a total of 1,560 simulated datasets. The different methods were also applied to a section of the UK biobank cohort for Atrial Fibrillation.

Results: For pure, two locus interactions, PLINK's implementation of BOOST recovered the highest number of correct interactions, with 53.9% and significantly better performing than the other methods (p = 4.52e - 36). For impure two locus interactions, MDR exhibited the best performance, recovering 62.2% of the most significant impure epistatic interactions (p = 6.31e - 90 for all but one test). The assessment of three locus interaction prediction revealed that wtest recovered the highest number (17.2%) of pure epistatic interactions(p = 8.49e - 14). wtest also recovered the highest number of three locus impure epistatic interactions (p = 6.76e - 48) while AntEpiSeeker ranked as the most significant the highest number of such interactions (40.5%). Finally, when applied to a real dataset for Atrial Fibrillation, most notably finding an interaction between SYNE2 and DTNB.

Publication types

  • Evaluation Study
  • Research Support, Non-U.S. Gov't

MeSH terms

  • Algorithms
  • Alleles
  • Atrial Fibrillation / genetics*
  • Data Mining / methods
  • Dystrophin-Associated Proteins / genetics
  • Epistasis, Genetic*
  • Gene Frequency
  • Genetic Loci*
  • Genome-Wide Association Study / methods
  • Genotype
  • Humans
  • Linear Models
  • Microfilament Proteins / genetics
  • Models, Genetic*
  • Multifactor Dimensionality Reduction
  • Nerve Tissue Proteins / genetics
  • Neuropeptides / genetics
  • Penetrance*
  • Polymorphism, Single Nucleotide
  • ROC Curve

Substances

  • DTNB protein, human
  • Dystrophin-Associated Proteins
  • Microfilament Proteins
  • Nerve Tissue Proteins
  • Neuropeptides
  • SYNE2 protein, human

Grants and funding

The authors acknowledge support from the NIHR Birmingham ECMC, NIHR Birmingham SRMRC, Nanocommons H2020-EU (731032) and the NIHR Birmingham Biomedical Research Centre and the MRC Heath Data Research UK (HDRUK/CFC/01), an initiative funded by UK Research and Innovation, Department of Health and Social Care (England) and the devolved administrations, and leading medical research charities. The views expressed in this publication are those of the authors and not necessarily those of the NHS, the National Institute for Health Research, the Medical Research Council or the Department of Health. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.