FSR: feature set reduction for scalable and accurate multi-class cancer subtype classification based on copy number

Gerard Wong; Christopher Leckie; Adam Kowalczyk

doi:10.1093/bioinformatics/btr644

FSR: feature set reduction for scalable and accurate multi-class cancer subtype classification based on copy number

Bioinformatics. 2012 Jan 15;28(2):151-9. doi: 10.1093/bioinformatics/btr644. Epub 2011 Nov 21.

Authors

Gerard Wong¹, Christopher Leckie, Adam Kowalczyk

Affiliation

¹ National ICT Australia, Victoria Research Laboratory, Parkville, Australia. gwong@csse.unimelb.edu.au

PMID: 22110244
DOI: 10.1093/bioinformatics/btr644

Abstract

Motivation: Feature selection is a key concept in machine learning for microarray datasets, where features represented by probesets are typically several orders of magnitude larger than the available sample size. Computational tractability is a key challenge for feature selection algorithms in handling very high-dimensional datasets beyond a hundred thousand features, such as in datasets produced on single nucleotide polymorphism microarrays. In this article, we present a novel feature set reduction approach that enables scalable feature selection on datasets with hundreds of thousands of features and beyond. Our approach enables more efficient handling of higher resolution datasets to achieve better disease subtype classification of samples for potentially more accurate diagnosis and prognosis, which allows clinicians to make more informed decisions in regards to patient treatment options.

Results: We applied our feature set reduction approach to several publicly available cancer single nucleotide polymorphism (SNP) array datasets and evaluated its performance in terms of its multiclass predictive classification accuracy over different cancer subtypes, its speedup in execution as well as its scalability with respect to sample size and array resolution. Feature Set Reduction (FSR) was able to reduce the dimensions of an SNP array dataset by more than two orders of magnitude while achieving at least equal, and in most cases superior predictive classification performance over that achieved on features selected by existing feature selection methods alone. An examination of the biological relevance of frequently selected features from FSR-reduced feature sets revealed strong enrichment in association with cancer.

Availability: FSR was implemented in MATLAB R2010b and is available at http://ww2.cs.mu.oz.au/~gwong/FSR.

Publication types

Research Support, Non-U.S. Gov't

MeSH terms

Algorithms*
Artificial Intelligence
DNA Copy Number Variations*
Humans
Neoplasms / classification
Neoplasms / diagnosis
Neoplasms / genetics*