Motivation: MicroRNAs (miRNAs) are a set of short (21-24 nt) non-coding RNAs that play significant roles as post-transcriptional regulators in animals and plants. While some existing methods use comparative genomic approaches to identify plant precursor miRNAs (pre-miRNAs), others are based on the complementarity characteristics between miRNAs and their target mRNAs sequences. However, they can only identify the homologous miRNAs or the limited complementary miRNAs. Furthermore, since the plant pre-miRNAs are quite different from the animal pre-miRNAs, all the ab initio methods for animals cannot be applied to plants. Therefore, it is essential to develop a method based on machine learning to classify real plant pre-miRNAs and pseudo genome hairpins.
Results: A novel classification method based on support vector machine (SVM) is proposed specifically for predicting plant pre-miRNAs. To make efficient prediction, we extract the pseudo hairpin sequences from the protein coding sequences of Arabidopsis thaliana and Glycine max, respectively. These pseudo pre-miRNAs are extracted in this study for the first time. A set of informative features are selected to improve the classification accuracy. The training samples are selected according to their distributions in the high-dimensional sample space. Our classifier PlantMiRNAPred achieves >90% accuracy on the plant datasets from eight plant species, including A.thaliana, Oryza sativa, Populus trichocarpa, Physcomitrella patens, Medicago truncatula, Sorghum bicolor, Zea mays and G.max. The superior performance of the proposed classifier can be attributed to the extracted plant pseudo pre-miRNAs, the selected training dataset and the carefully selected features. The ability of PlantMiRNAPred to discern real and pseudo pre-miRNAs provides a viable method for discovering new non-homologous plant pre-miRNAs.