Analysis of SARS-CoV-2 genome variation using a minimal number of selected informative sites conforming a genetic barcode presents several drawbacks. We show that purely mathematical procedures for site selection should be supervised by known phylogeny (i) to ensure that solid tree branches are represented instead of mutational hotspots with poor phylogeographic proprieties, and (ii) to avoid phylogenetic redundancy. We propose a procedure that prevents information redundancy in site selection by considering the cumulative informativeness of previously selected sites (as a proxy for phylogenetic-based criteria). This procedure demonstrates that, for short barcodes (e.g., 11 sites), there are thousands of informative site combinations that improve previous proposals. We also show that barcodes based on worldwide databases inevitably prioritize variants located at the basal nodes of the phylogeny, such that most representative genomes in these ancestral nodes are no longer in circulation. Consequently, coronavirus phylodynamics cannot be properly captured by universal genomic barcodes because most SARS-CoV-2 variation is generated in geographically restricted areas by the continuous introduction of domestic variants.
使用最少量的选定信息位点组成的基因条形码在分析SARS-Cov-2基因组变异时存在诸多弊端。我们的研究表明,仅用数学程序来选定位点时应由已知的系统发育学研究作为指导,(1)确保用实体树分支来代表,而不是具有较差的系统发育地理特性的突变热点;(2)避免系统发育冗余。我们提出了一个流程,即通过考虑先前选定位点的累积的信息量(作为基于系统发育分析的标准代表)来避免位点选择中的信息冗余。这个程序演示了,对于一些短的条形码(如有11个位点)来说,也有成千上万位点组合信息来改进之前的提议。我们的研究还表明,基于全球数据库的条形码不可避免的优先考虑那些位于系统发育的基础节点上的变异,这使得在这些祖先节点上的大多数代表性基因组不再反复出现。因此,冠状病毒的系统发育动力学无法通过普遍的基因组条形码捕获,因为大多数的SARS-Cov-2变异是在地理限制区域内引入当地的变异产生的。.
Keywords: Barcode; COVID-19; Informative subtype markers; Phylodynamics; Phylogeny; SARS-COV-2.