Strategies to improve usability and preserve accuracy in biological sequence databases

Johan Bengtsson-Palme; Fredrik Boulund; Robert Edström; Amir Feizi; Anna Johnning; Viktor A Jonsson; Fredrik H Karlsson; Chandan Pal; Mariana Buongermino Pereira; Anna Rehammar; José Sanchez; Kemal Sanli; Kaisa Thorell

doi:10.1002/pmic.201600034

Strategies to improve usability and preserve accuracy in biological sequence databases

Proteomics. 2016 Sep;16(18):2454-60. doi: 10.1002/pmic.201600034.

Authors

Johan Bengtsson-Palme^{1

2}, Fredrik Boulund^{3

4

5}, Robert Edström^{4

6}, Amir Feizi⁷, Anna Johnning^{3

4}, Viktor A Jonsson⁴, Fredrik H Karlsson⁷, Chandan Pal^{8

3}, Mariana Buongermino Pereira^{3

4}, Anna Rehammar⁴, José Sanchez^{4

9}, Kemal Sanli¹⁰, Kaisa Thorell⁵

Affiliations

¹ Department of Infectious Diseases, Institute of Biomedicine, the Sahlgrenska Academy, University of Gothenburg, Gothenburg, Sweden. johan.bengtsson-palme@microbiology.se.
² Centre for Antibiotic Resistance Research (CARe), University of Gothenburg, Gothenburg, Sweden. johan.bengtsson-palme@microbiology.se.
³ Centre for Antibiotic Resistance Research (CARe), University of Gothenburg, Gothenburg, Sweden.
⁴ Department of Mathematical Sciences, Chalmers University of Technology and University of Gothenburg, Gothenburg, Sweden.
⁵ Department of Microbiology, Tumor and Cell Biology, Karolinska Institutet, Stockholm, Sweden.
⁶ Department of Computer Science and Engineering, Chalmers University of Technology, Gothenburg, Sweden.
⁷ Department of Biology and Biological Engineering, Chalmers University of Technology, Gothenburg, Sweden.
⁸ Department of Infectious Diseases, Institute of Biomedicine, the Sahlgrenska Academy, University of Gothenburg, Gothenburg, Sweden.
⁹ Bioinformatics Core Facility, University of Gothenburg, Gothenburg, Sweden.
¹⁰ Department of Biology and Environmental Sciences, University of Gothenburg, Gothenburg, Sweden.

PMID: 27528420
DOI: 10.1002/pmic.201600034

Abstract

Biology is increasingly dependent on large-scale analysis, such as proteomics, creating a requirement for efficient bioinformatics. Bioinformatic predictions of biological functions rely upon correctly annotated database sequences, and the presence of inaccurately annotated or otherwise poorly described sequences introduces noise and bias to biological analyses. Accurate annotations are, for example, pivotal for correct identification of polypeptide fragments. However, standards for how sequence databases are organized and presented are currently insufficient. Here, we propose five strategies to address fundamental issues in the annotation of sequence databases: (i) to clearly separate experimentally verified and unverified sequence entries; (ii) to enable a system for tracing the origins of annotations; (iii) to separate entries with high-quality, informative annotation from less useful ones; (iv) to integrate automated quality-control software whenever such tools exist; and (v) to facilitate postsubmission editing of annotations and metadata associated with sequences. We believe that implementation of these strategies, for example as requirements for publication of database papers, would enable biology to better take advantage of large-scale data.

Keywords: Annotation; Bioinformatics; Databases; Functional prediction; Sequencing; Standards.

Publication types

Research Support, Non-U.S. Gov't

MeSH terms

Computational Biology / methods*
Databases, Protein*
Quality Control
Sequence Analysis
Software*