Origin, phylogeny, variability and epitope conservation of SARS-CoV-2 worldwide

Virus Res. 2021 Oct 15:304:198526. doi: 10.1016/j.virusres.2021.198526. Epub 2021 Jul 30.

Abstract

The coronavirus disease 2019 (COVID-19) pandemic caused by the severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) poses innumerous challenges, like understanding what triggered the emergence of this new human virus, how this RNA virus is evolving or how the variability of viral genome may impact the primary structure of proteins that are targets for vaccine. We analyzed 19471 SARS-CoV-2 genomes available at the GISAID database from all over the world and 3335 genomes of other Coronoviridae family members available at GenBank, collecting SARS-CoV-2 high-quality genomes and distinct Coronoviridae family genomes. Additionally, we analyzed 199,984 spike glycoprotein sequences. Here, we identify a SARS-CoV-2 emerging cluster containing 13 closely related genomes isolated from bat and pangolin that showed evidence of recombination, which may have contributed to the emergence of SARS-CoV-2. The analyzed SARS-CoV-2 genomes presented 9632 single nucleotide variants (SNVs) corresponding to a variant density of 0.3 over the genome, and a clear geographic distribution. SNVs are unevenly distributed throughout the genome and hotspots for mutations were found for the spike gene and ORF 1ab. We describe a set of predicted spike protein epitopes whose variability is negligible. Additionally, all predicted epitopes for the structural E, M and N proteins are highly conserved. The amino acid changes present in the spike glycoprotein of variables of concern (VOCs) comprise between 3.4% and 20.7% of the predicted epitopes of this protein. These results favors the continuous efficacy of the available vaccines targeting the spike protein, and other structural proteins. Multiple epitopes vaccines should sustain vaccine efficacy since at least some of the epitopes present in variability regions of VOCs are conserved and thus recognizable by antibodies.

Keywords: COVID-19; Coronavirus comparative genomics; Epitope prediction; SARS-CoV-2 genomics; Spike protein.

Publication types

  • Research Support, Non-U.S. Gov't

MeSH terms

  • Animals
  • COVID-19 / epidemiology
  • COVID-19 / virology*
  • Databases, Genetic
  • Genome, Viral
  • Humans
  • Mutation
  • Pandemics*
  • Phylogeography
  • SARS-CoV-2* / classification
  • SARS-CoV-2* / genetics