Classification of SARS-CoV-2 sequences as recombinants via a pre-trained CNN and identification of a mathematical signature relative to recombinant feature at Spike, via interpretability

PLoS One. 2024 Aug 26;19(8):e0309391. doi: 10.1371/journal.pone.0309391. eCollection 2024.

Abstract

The global impact of the SARS-CoV-2 pandemic has underscored the need for a deeper understanding of viral evolution to anticipate new viruses or variants. Genetic recombination is a fundamental mechanism in viral evolution, yet it remains poorly understood. In this study, we conducted a comprehensive research on the genetic regions associated with genetic recombination features in SARS-CoV-2. With this aim, we implemented a two-phase transfer learning approach using genomic spectrograms of complete SARS-CoV-2 sequences. In the first phase, we utilized a pre-trained VGG-16 model with genomic spectrograms of HIV-1, and in the second phase, we applied HIV-1 VGG-16 model to SARS-CoV-2 spectrograms. The identification of key recombination hot zones was achieved using the Grad-CAM interpretability tool, and the results were analyzed by mathematical and image processing techniques. Our findings unequivocally identify the SARS-CoV-2 Spike protein (S protein) as the pivotal region in the genetic recombination feature. For non-recombinant sequences, the relevant frequencies clustered around 1/6 and 1/12. In recombinant sequences, the sharp prominence of the main hot zone in the Spike protein prominently indicated a frequency of 1/6. These findings suggest that in the arithmetic series, every 6 nucleotides (two triplets) in S may encode crucial information, potentially concealing essential details about viral characteristics, in this case, recombinant feature of a SARS-CoV-2 genetic sequence. This insight further underscores the potential presence of multifaceted information within the genome, including mathematical signatures that define an organism's unique attributes.

MeSH terms

  • COVID-19* / epidemiology
  • COVID-19* / genetics
  • COVID-19* / virology
  • Genome, Viral
  • HIV-1 / classification
  • HIV-1 / genetics
  • Humans
  • Neural Networks, Computer
  • Recombination, Genetic*
  • SARS-CoV-2* / genetics
  • Spike Glycoprotein, Coronavirus* / genetics

Substances

  • Spike Glycoprotein, Coronavirus
  • spike protein, SARS-CoV-2

Grants and funding

This work was supported by the Research Training Grants Program - University of Deusto: Ref. FPI UD_2021_10. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.