Comparing mixed models and Random Forest association tests using naturalgwas and a Striped Bass SNP data set

Nathalie M LeBlanc; Scott A Pavey

doi:10.1111/1755-0998.13701

Comparing mixed models and Random Forest association tests using naturalgwas and a Striped Bass SNP data set

Mol Ecol Resour. 2023 Jan;23(1):145-158. doi: 10.1111/1755-0998.13701. Epub 2022 Aug 29.

Authors

Nathalie M LeBlanc¹, Scott A Pavey¹

Affiliation

¹ Department of Biological Sciences, Canadian Rivers Institute, University of New Brunswick, Saint John, New Brunswick, Canada.

PMID: 35980658
DOI: 10.1111/1755-0998.13701

Abstract

In this study, we used the phenotype simulation package naturalgwas to test the performance of Zhao's Random Forest method in comparison to an uncorrected Random Forest test, latent factor mixed models (LFMM), genome-wide efficient mixed models (GEMMA), and confounder adjusted linear regression (CATE). We created 400 sets of phenotypes, corresponding to five effect sizes and two, five, 15, or 30 causal loci, simulated from two empirical data sets containing SNPs from Striped Bass representing three and 13 populations. All association methods were evaluated for their ability to detect genotype-phenotype associations based on power, false discovery rates, and number of false positives. Genomic inflation was highest for uncorrected Random Forest and LFMM tests and lowest for Gemma and Zhao's Random Forest. All association tests had similar power to detect causal loci, and Zhao's Random Forest had the lowest false discovery rate in all scenarios. To measure the performance of association tests in small data sets with few loci surrounding a causal gene we also ran analyses again after removing causal loci from each data set. All association tests were only able to find true positives, defined as loci located within 30 kbp of a causal locus, in 3%-18% of simulations. In contrast, at least one false positive was found in 17%-44% of simulations. Zhao's Random Forest again identified the fewest false positives of all association tests studied. The ability to test the power of association tests for individual empirical data sets can be an extremely useful first step when designing a GWAS study.

Keywords: GWAS; SNPs; association test; mixed models; random forest.

MeSH terms

Animals
Bass* / genetics
Genetic Association Studies
Genome-Wide Association Study
Models, Genetic
Phenotype
Polymorphism, Single Nucleotide*

Abstract

MeSH terms

Grants and funding