Realistic artificial DNA sequences as negative controls for computational genomics

Juan Caballero; Arian F A Smit; Leroy Hood; Gustavo Glusman

doi:10.1093/nar/gku356

Realistic artificial DNA sequences as negative controls for computational genomics

Nucleic Acids Res. 2014 Jul;42(12):e99. doi: 10.1093/nar/gku356. Epub 2014 May 6.

Authors

Juan Caballero¹, Arian F A Smit¹, Leroy Hood¹, Gustavo Glusman²

Affiliations

¹ Institute for Systems Biology, 401 Terry Ave. N, Seattle, WA 98109, USA.
² Institute for Systems Biology, 401 Terry Ave. N, Seattle, WA 98109, USA gustavo@systemsbiology.org.

Abstract

A common practice in computational genomic analysis is to use a set of 'background' sequences as negative controls for evaluating the false-positive rates of prediction tools, such as gene identification programs and algorithms for detection of cis-regulatory elements. Such 'background' sequences are generally taken from regions of the genome presumed to be intergenic, or generated synthetically by 'shuffling' real sequences. This last method can lead to underestimation of false-positive rates. We developed a new method for generating artificial sequences that are modeled after real intergenic sequences in terms of composition, complexity and interspersed repeat content. These artificial sequences can serve as an inexhaustible source of high-quality negative controls. We used artificial sequences to evaluate the false-positive rates of a set of programs for detecting interspersed repeats, ab initio prediction of coding genes, transcribed regions and non-coding genes. We found that RepeatMasker is more accurate than PClouds, Augustus has the lowest false-positive rate of the coding gene prediction programs tested, and Infernal has a low false-positive rate for non-coding gene detection. A web service, source code and the models for human and many other species are freely available at http://repeatmasker.org/garlic/.

Publication types

Research Support, N.I.H., Extramural

MeSH terms

Animals
Cats
Cattle
DNA, Intergenic / chemistry*
Dogs
Genes
Genomics / methods*
Guinea Pigs
Humans
Introns
Mice
Models, Statistical
Rabbits
Rats
Repetitive Sequences, Nucleic Acid
Sequence Analysis, DNA / methods*

Substances

DNA, Intergenic

Abstract

Publication types

MeSH terms

Substances

Grants and funding