The Terabase Search Engine: a large-scale relational database of short-read sequences

Richard Wilton; Sarah J Wheelan; Alexander S Szalay; Steven L Salzberg

doi:10.1093/bioinformatics/bty657

The Terabase Search Engine: a large-scale relational database of short-read sequences

Bioinformatics. 2019 Feb 15;35(4):665-670. doi: 10.1093/bioinformatics/bty657.

Authors

Richard Wilton¹, Sarah J Wheelan^{2

3}, Alexander S Szalay^{1

4}, Steven L Salzberg^{3

4

5

6}

Affiliations

¹ Department of Physics and Astronomy, Johns Hopkins University, Baltimore, MD, USA.
² Department of Oncology, Johns Hopkins University School of Medicine, Baltimore, MD, USA.
³ Department of Biostatistics, Johns Hopkins Bloomberg School of Public Health, Baltimore, MD, USA.
⁴ Department of Computer Science, Johns Hopkins University School of Medicine, Baltimore, MD, USA.
⁵ Department of Biomedical Engineering, Johns Hopkins University School of Medicine, Baltimore, MD, USA.
⁶ Center for Computational Biology, McKusick-Nathans Institute of Genetic Medicine, Johns Hopkins University School of Medicine, Baltimore, MD, USA.

Abstract

Motivation: DNA sequencing archives have grown to enormous scales in recent years, and thousands of human genomes have already been sequenced. The size of these data sets has made searching the raw read data infeasible without high-performance data-query technology. Additionally, it is challenging to search a repository of short-read data using relational logic and to apply that logic across samples from multiple whole-genome sequencing samples.

Results: We have built a compact, efficiently-indexed database that contains the raw read data for over 250 human genomes, encompassing trillions of bases of DNA, and that allows users to search these data in real-time. The Terabase Search Engine enables retrieval from this database of all the reads for any genomic location in a matter of seconds. Users can search using a range of positions or a specific sequence that is aligned to the genome on the fly.

Availability and implementation: Public access to the Terabase Search Engine database is available at http://tse.idies.jhu.edu.

Supplementary information: Supplementary data are available at Bioinformatics online.

Publication types

Research Support, N.I.H., Extramural

MeSH terms

Databases, Genetic*
Genome, Human
Genomics
Humans
Search Engine*
Sequence Analysis, DNA
Software*

Abstract

Publication types

MeSH terms

Grants and funding