SamCluster: an integrated scheme for automatic discovery of sample classes using gene expression profile

Bioinformatics. 2003 May 1;19(7):811-7. doi: 10.1093/bioinformatics/btg095.

Abstract

Motivation: Feature (gene) selection can dramatically improve the accuracy of gene expression profile based sample class prediction. Many statistical methods for feature (gene) selection such as stepwise optimization and Monte Carlo simulation have been developed for tissue sample classification. In contrast to class prediction, few statistical and computational methods for feature selection have been applied to clustering algorithms for pattern discovery.

Results: An integrated scheme and corresponding program SamCluster for automatic discovery of sample classes based on gene expression profile is presented in this report. The scheme incorporates the feature selection algorithms based on the calculation of CV (coefficient of variation) and t-test into hierarchical clustering and proceeds as follows. At first, the genes with their CV greater than the pre-specified threshold are selected for cluster analysis, which results in two putative sample classes. Then, significantly differentially expressed genes in the two putative sample classes with p-values < or = 0.01, 0.05, or 0.1 from t-test are selected for further cluster analysis. The above processes were iterated until the two stable sample classes were found. Finally, the consensus sample classes are constructed from the putative classes that are derived from the different CV thresholds, and the best putative sample classes that have the minimum distance between the consensus classes and the putative classes are identified. To evaluate the performance of the feature selection for cluster analysis, the proposed scheme was applied to four expression datasets COLON, LEUKEMIA72, LEUKEMIA38, and OVARIAN. The results show that there are only 5, 1, 0, and 0 samples that have been misclassified, respectively. We conclude that the proposed scheme, SamCluster, is an efficient method for discovery of sample classes using gene expression profile.

Availability: The related program SamCluster is available upon request or from the web page http://www.sph.uth.tmc.edu:8052/hgc/Downloads.asp.

Publication types

  • Comparative Study
  • Evaluation Study
  • Research Support, Non-U.S. Gov't
  • Validation Study

MeSH terms

  • Algorithms*
  • Animals
  • Cluster Analysis*
  • Colonic Neoplasms / classification
  • Colonic Neoplasms / genetics
  • Consensus Sequence
  • Female
  • Gene Expression Profiling / methods*
  • Humans
  • Leukemia / classification
  • Leukemia / genetics
  • Neoplasms / classification*
  • Neoplasms / genetics*
  • Oligonucleotide Array Sequence Analysis / methods*
  • Ovarian Neoplasms / classification
  • Ovarian Neoplasms / genetics
  • Pattern Recognition, Automated
  • Reproducibility of Results
  • Sensitivity and Specificity
  • Sequence Alignment / methods
  • Sequence Analysis, DNA / methods*
  • Software
  • Systems Integration