MinSet: a general approach to derive maximally representative database subsets by using fragment dictionaries and its application to the SCOP database

Alessandro Pandini; Laura Bonati; Franca Fraternali; Jens Kleinjung

doi:10.1093/bioinformatics/btl637

MinSet: a general approach to derive maximally representative database subsets by using fragment dictionaries and its application to the SCOP database

Bioinformatics. 2007 Feb 15;23(4):515-6. doi: 10.1093/bioinformatics/btl637. Epub 2007 Jan 3.

Authors

Alessandro Pandini¹, Laura Bonati, Franca Fraternali, Jens Kleinjung

Affiliation

¹ Dipartimento di Scienze dell'Ambiente e del Territorio, Università degli Studi di Milano-Bicocca, Milano, Italy.

PMID: 17204463
DOI: 10.1093/bioinformatics/btl637

Abstract

Motivation: The size of current protein databases is a challenge for many Bioinformatics applications, both in terms of processing speed and information redundancy. It may be therefore desirable to efficiently reduce the database of interest to a maximally representative subset.

Results: The MinSet method employs a combination of a Suffix Tree and a Genetic Algorithm for the generation, selection and assessment of database subsets. The approach is generally applicable to any type of string-encoded data, allowing for a drastic reduction of the database size whilst retaining most of the information contained in the original set. We demonstrate the performance of the method on a database of protein domain structures encoded as strings. We used the SCOP40 domain database by translating protein structures into character strings by means of a structural alphabet and by extracting optimized subsets according to an entropy score that is based on a constant-length fragment dictionary. Therefore, optimized subsets are maximally representative for the distribution and range of local structures. Subsets containing only 10% of the SCOP structure classes show a coverage of >90% for fragments of length 1-4.

Availability: http://mathbio.nimr.mrc.ac.uk/~jkleinj/MinSet.

Supplementary information: Supplementary data are available at Bioinformatics online.

MeSH terms

Algorithms*
Data Compression / methods*
Database Management Systems*
Databases, Protein*
Dictionaries, Chemical as Topic
Peptide Fragments / chemistry
Peptide Fragments / classification
Proteins / chemistry*
Proteins / classification*
Sequence Analysis, Protein / methods*
Software

Substances

Peptide Fragments
Proteins

Grants and funding

MC_U117581331/MRC_/Medical Research Council/United Kingdom