Valid inference for machine learning-assisted genome-wide association studies

Jiacheng Miao; Yixuan Wu; Zhongxuan Sun; Xinran Miao; Tianyuan Lu; Jiwei Zhao; Qiongshi Lu

doi:10.1038/s41588-024-01934-0

Valid inference for machine learning-assisted genome-wide association studies

Nat Genet. 2024 Nov;56(11):2361-2369. doi: 10.1038/s41588-024-01934-0. Epub 2024 Sep 30.

Authors

Jiacheng Miao¹, Yixuan Wu¹, Zhongxuan Sun¹, Xinran Miao², Tianyuan Lu^{3

4}, Jiwei Zhao^{1

2}, Qiongshi Lu^{5

6

7}

Affiliations

¹ Department of Biostatistics and Medical Informatics, University of Wisconsin-Madison, Madison, WI, USA.
² Department of Statistics, University of Wisconsin-Madison, Madison, WI, USA.
³ Lady Davis Institute for Medical Research, Jewish General Hospital, Montreal, Quebec, Canada.
⁴ Department of Statistical Sciences, University of Toronto, Toronto, Ontario, Canada.
⁵ Department of Biostatistics and Medical Informatics, University of Wisconsin-Madison, Madison, WI, USA. qlu@biostat.wisc.edu.
⁶ Department of Statistics, University of Wisconsin-Madison, Madison, WI, USA. qlu@biostat.wisc.edu.
⁷ Center for Demography of Health and Aging, University of Wisconsin-Madison, Madison, WI, USA. qlu@biostat.wisc.edu.

PMID: 39349818
DOI: 10.1038/s41588-024-01934-0

Abstract

Machine learning (ML) has become increasingly popular in almost all scientific disciplines, including human genetics. Owing to challenges related to sample collection and precise phenotyping, ML-assisted genome-wide association study (GWAS), which uses sophisticated ML techniques to impute phenotypes and then performs GWAS on the imputed outcomes, have become increasingly common in complex trait genetics research. However, the validity of ML-assisted GWAS associations has not been carefully evaluated. Here, we report pervasive risks for false-positive associations in ML-assisted GWAS and introduce Post-Prediction GWAS (POP-GWAS), a statistical framework that redesigns GWAS on ML-imputed outcomes. POP-GWAS ensures valid and powerful statistical inference irrespective of imputation quality and choice of algorithm, requiring only GWAS summary statistics as input. We employed POP-GWAS to perform a GWAS of bone mineral density derived from dual-energy X-ray absorptiometry imaging at 14 skeletal sites, identifying 89 new loci and revealing skeletal site-specific genetic architecture. Our framework offers a robust analytic solution for future ML-assisted GWAS.

MeSH terms

Algorithms
Bone Density / genetics
Genome-Wide Association Study* / methods
Humans
Machine Learning*
Phenotype
Polymorphism, Single Nucleotide*

Abstract

MeSH terms

Grants and funding