EmbedGEM: a framework to evaluate the utility of embeddings for genetic discovery

Sumit Mukherjee; Zachary R McCaw; Jingwen Pei; Anna Merkoulovitch; Tom Soare; Raghav Tandon; David Amar; Hari Somineni; Christoph Klein; Santhosh Satapati; David Lloyd; Christopher Probert; Insitro Research Team; Daphne Koller; Colm O'Dushlaine; Theofanis Karaletsos

doi:10.1093/bioadv/vbae135

EmbedGEM: a framework to evaluate the utility of embeddings for genetic discovery

Bioinform Adv. 2024 Sep 17;4(1):vbae135. doi: 10.1093/bioadv/vbae135. eCollection 2024.

Authors

Affiliations

¹ Insitro Inc, South San Francisco, California 94080, United States.
² Center for Machine Learning, Georgia Institute of Technology, Georgia 30332, United States.
³ Chan-Zuckerberg Initiative, Redwood City, California 94063, United States.

Abstract

Summary: Machine learning-derived embeddings are a compressed representation of high content data modalities. Embeddings can capture detailed information about disease states and have been qualitatively shown to be useful in genetic discovery. Despite their promise, embeddings have a major limitation: it is unclear if genetic variants associated with embeddings are relevant to the disease or trait of interest. In this work, we describe EmbedGEM (Embedding Genetic Evaluation Methods), a framework to systematically evaluate the utility of embeddings in genetic discovery. EmbedGEM focuses on comparing embeddings along two axes: heritability and disease relevance. As measures of heritability, we consider the number of genome-wide significant associations and the mean $χ^{2}$ statistic at significant loci. For disease relevance, we compute polygenic risk scores for each embedding principal component, then evaluate their association with high-confidence disease or trait labels in a held-out evaluation patient set. While our development of EmbedGEM is motivated by embeddings, the approach is generally applicable to multivariate traits and can readily be extended to accommodate additional metrics along the evaluation axes. We demonstrate EmbedGEM's utility by evaluating embeddings and multivariate traits in two separate datasets: (i) a synthetic dataset simulated to demonstrate the ability of the framework to correctly rank traits based on their heritability and disease relevance and (ii) a real data from the UK Biobank, including metabolic and liver-related traits. Importantly, we show that greater disease relevance does not automatically follow from greater heritability.

Availability and implementation: https://github.com/insitro/EmbedGEM.