A predictive language model for SARS-CoV-2 evolution

Enhao Ma; Xuan Guo; Mingda Hu; Penghua Wang; Xin Wang; Congwen Wei; Gong Cheng

doi:10.1038/s41392-024-02066-x

A predictive language model for SARS-CoV-2 evolution

Signal Transduct Target Ther. 2024 Dec 23;9(1):353. doi: 10.1038/s41392-024-02066-x.

Authors

Enhao Ma^#¹, Xuan Guo^#^{2

3}, Mingda Hu⁴, Penghua Wang⁵, Xin Wang⁴, Congwen Wei⁶, Gong Cheng^{7

8}

Affiliations

¹ School of Basic Medical Science, Tsinghua University, 30 Shuangqing Rd., Haidian District, Beijing, 100084, China.
² School of Basic Medical Science, Tsinghua University, 30 Shuangqing Rd., Haidian District, Beijing, 100084, China. 15210418734@163.com.
³ Institute of Infectious Diseases, Shenzhen Bay Laboratory, Guangqiao Rd., Guangming District, Shenzhen, Guangdong, 518000, China. 15210418734@163.com.
⁴ Beijing Institute of Biotechnology, 20 Dongdajie, Fengtai District, Beijing, 100071, China.
⁵ Department of Immunology, School of Medicine, University of Connecticut Health Center, Farmington, CT, 06030, USA.
⁶ Beijing Institute of Biotechnology, 20 Dongdajie, Fengtai District, Beijing, 100071, China. weicongwen@aliyun.com.
⁷ School of Basic Medical Science, Tsinghua University, 30 Shuangqing Rd., Haidian District, Beijing, 100084, China. gongcheng@mail.tsinghua.edu.cn.
⁸ Institute of Infectious Diseases, Shenzhen Bay Laboratory, Guangqiao Rd., Guangming District, Shenzhen, Guangdong, 518000, China. gongcheng@mail.tsinghua.edu.cn.

^# Contributed equally.

PMID: 39710752
DOI: 10.1038/s41392-024-02066-x

Abstract

Modeling and predicting mutations are critical for COVID-19 and similar pandemic preparedness. However, existing predictive models have yet to integrate the regularity and randomness of viral mutations with minimal data requirements. Here, we develop a non-demanding language model utilizing both regularity and randomness to predict candidate SARS-CoV-2 variants and mutations that might prevail. We constructed the "grammatical frameworks" of the available S1 sequences for dimension reduction and semantic representation to grasp the model's latent regularity. The mutational profile, defined as the frequency of mutations, was introduced into the model to incorporate randomness. With this model, we successfully identified and validated several variants with significantly enhanced viral infectivity and immune evasion by wet-lab experiments. By inputting the sequence data from three different time points, we detected circulating strains or vital mutations for XBB.1.16, EG.5, JN.1, and BA.2.86 strains before their emergence. In addition, our results also predicted the previously unknown variants that may cause future epidemics. With both the data validation and experiment evidence, our study represents a fast-responding, concise, and promising language model, potentially generalizable to other viral pathogens, to forecast viral evolution and detect crucial hot mutation spots, thus warning the emerging variants that might raise public health concern.

MeSH terms

COVID-19* / genetics
COVID-19* / virology
Evolution, Molecular
Humans
Mutation*
SARS-CoV-2* / genetics
SARS-CoV-2* / pathogenicity
Spike Glycoprotein, Coronavirus / genetics

Substances

Spike Glycoprotein, Coronavirus

Supplementary concepts

SARS-CoV-2 variants

Grants and funding

32188101, 81961160737, and 31825001/National Natural Science Foundation of China (National Science Foundation of China)