Motivation: Positional weight matrix (PWM) is derived from a set of experimentally determined binding sites. Here we explore whether there exist subclasses of binding sites and if the mixture of these subclass-PWMs can improve the binding site prediction. Intuitively, the subclasses correspond to either distinct binding preference of the same transcription factor in different contexts or distinct subtypes of the transcription factor.
Availability: We report an Expectation Maximization algorithm adapting the mixture model of Baily and Elkan. We assessed the relative merit of using two subclass-PWMs. The resulting PWMs were evaluated with respect to preferred conservation (relative to mouse) of potential sites in human promoters and expression coherence of the potential target genes. Based on 64 JASPAR vertebrate PWMs, 61-81% of the cases resulted in a higher conservation using the mixture model. Also in 98% of the cases the expression coherence was higher for the target genes of one of the subclass-PWMs. Our analysis of Reb1 sites is consistent with previously discovered subtypes using independent methods. Additionally application of our method to mutated sites for transcription factor LEU3 reveals subclasses that segregate into strongly binding and weakly binding sites with P-value of 0.008. This is the first study which attempts to quantify the subtly different binding specificities of a transcription factor on a large scale and suggests the use of a mixture of PWMs, instead of the current practice of using a single PWM, for a transcription factor.