Abstract
Previous work has shown that proteins that have the potential to be vaccine candidates can be predicted from features derived from their amino acid sequences. In this work, we make an empirical comparison across various machine learning classifiers on this sequence-based inference problem. Using systematic cross validation on a dataset of 200 known vaccine candidates and 200 negative examples, with a set of 525 features derived from the AA sequences and feature selection applied through a greedy backward elimination approach, we show that simple classification algorithms often perform as well as more complex support vector kernel machines. The work also includes a novel cross validation applied across bacterial species, i.e. the validation proteins all come from a specific species of bacterium not represented in the training set. We termed this type of validation Leave One Bacteria Out Validation (LOBOV).
Publication types
-
Evaluation Study
-
Research Support, Non-U.S. Gov't
MeSH terms
-
Algorithms*
-
Antigens, Bacterial / immunology*
-
Bacterial Proteins / immunology*
-
Bacterial Vaccines / immunology*
-
Computational Biology / methods*
-
Humans
-
Machine Learning
-
Vaccinology*
Substances
-
Antigens, Bacterial
-
Bacterial Proteins
-
Bacterial Vaccines
Grants and funding
One of the authors (CHW) is employed by Merck Research Laboratories. Merck Research Laboratories provided support in the form of salary for CHW but did not have any additional role in the study design, data collection and analysis, decision to publish, or preparation of the manuscript. AH’s contribution to this work was funded by a Marie Curie Career Integration Grant (CIG, PCIG13-GA2013-618334). MN’s contribution to this work was funded by a EPSRC grant: Joining the Dots, from Data to Insight GR EP/NO14189/1. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.