Motivation: Microarray experiments generate a considerable amount of data, which analyzed properly help us gain a huge amount of biologically relevant information about the global cellular behaviour. Clustering (grouping genes with similar expression profiles) is one of the first steps in data analysis of high-throughput expression measurements. A number of clustering algorithms have proved useful to make sense of such data. These classical algorithms, though useful, suffer from several drawbacks (e.g. they require the predefinition of arbitrary parameters like the number of clusters; they force every gene into a cluster despite a low correlation with other cluster members). In the following we describe a novel adaptive quality-based clustering algorithm that tackles some of these drawbacks.
Results: We propose a heuristic iterative two-step algorithm: First, we find in the high-dimensional representation of the data a sphere where the "density" of expression profiles is locally maximal (based on a preliminary estimate of the radius of the cluster-quality-based approach). In a second step, we derive an optimal radius of the cluster (adaptive approach) so that only the significantly coexpressed genes are included in the cluster. This estimation is achieved by fitting a model to the data using an EM-algorithm. By inferring the radius from the data itself, the biologist is freed from finding an optimal value for this radius by trial-and-error. The computational complexity of this method is approximately linear in the number of gene expression profiles in the data set. Finally, our method is successfully validated using existing data sets.
Availability: http://www.esat.kuleuven.ac.be/~thijs/Work/Clustering.html