Using the COG database to improve gene recognition in complete genomes

Genetica. 2000;108(1):9-17. doi: 10.1023/a:1004031323748.

Abstract

A complete understanding of the biology of an organism necessarily starts with knowledge of its genetic makeup. Proteins encoded in a genome must be identified and characterized, and the presence or absence of specific sets of proteins must be noted in order to determine the possible biochemical pathways or functional systems utilized by that organism. The COG database presents a set of tools suited to these purposes, including the ability to select protein families (COGs) that contain proteins from a specified set of species. The selection is based upon a phylogenetic pattern, which is a shorthand representation of the presence or absence of a particular species in a COG. Here we present the use of phylogenetic patterns as a means to perform targeted searches for undetected protein-coding genes in complete genomes.

MeSH terms

  • Algorithms
  • Bacterial Proteins / genetics
  • Computational Biology / methods*
  • Databases, Factual*
  • Fungal Proteins / genetics
  • Genome, Archaeal*
  • Genome, Bacterial*
  • Molecular Sequence Data
  • Multigene Family / genetics*
  • Phylogeny
  • Saccharomyces cerevisiae / genetics
  • Sequence Homology, Amino Acid
  • Species Specificity

Substances

  • Bacterial Proteins
  • Fungal Proteins