Evidence-based gene predictions in plant genomes

Genome Res. 2009 Oct;19(10):1912-23. doi: 10.1101/gr.088997.108. Epub 2009 Jun 18.

Abstract

Automated evidence-based gene building is a rapid and cost-effective way to provide reliable gene annotations on newly sequenced genomes. One of the limitations of evidence-based gene builders, however, is their requirement for transcriptional evidence-known proteins, full-length cDNAs, or expressed sequence tags (ESTs)-in the species of interest. This limitation is of particular concern for plant genomes, where the rate of genome sequencing is greatly outpacing the rate of EST- and cDNA-sequencing projects. To overcome this limitation, we have developed an evidence-based gene build system (the Gramene pipeline) that can use transcriptional evidence across related species. The Gramene pipeline uses the Ensembl computing infrastructure with a novel data processing scheme. Using the previously annotated plant genomes, the dicot Arabidopsis thaliana and the monocot Oryza sativa, we show that the cross-species ESTs from within monocot or dicot class are a valuable source of evidence for gene predictions. We also find that, using only EST and cross-species evidence, the Gramene pipeline can generate a plant gene set that is comparable in quality to the human genes based on known proteins and full-length cDNAs. We compare the Gramene pipeline to several widely used ab initio gene prediction programs in rice; this comparison shows the pipeline performs favorably at both the gene and exon levels with cross-species gene products only. We discuss the results of testing the pipeline on a 22-Mb region of the newly sequenced maize genome and discuss potential application of the pipeline to other genomes.

Publication types

  • Evaluation Study
  • Research Support, U.S. Gov't, Non-P.H.S.

MeSH terms

  • Algorithms
  • Arabidopsis / genetics
  • Brassica / genetics
  • Chromosome Mapping / methods*
  • Computational Biology / methods*
  • Electronic Data Processing / methods
  • Forecasting
  • Genes, Plant*
  • Genome, Plant*
  • Oryza / genetics
  • Plant Proteins / analysis
  • Plant Proteins / genetics
  • Quality Control
  • Sensitivity and Specificity
  • Sequence Alignment / methods
  • Species Specificity
  • Zea mays / genetics

Substances

  • Plant Proteins