Title:
Cancer Class Discovery and Classification using
Microarray Gene Expression Data
Abstract:
Class discovery and classification are of crucial importance for
determining efficient therapies in cancer treatment.
Recent studies have shown that DNA microarray technology is a
powerful tool for discovery and classification of cancer types
and subtypes.
We propose a unified approach to discovery of putative
classes of cancers and classification of the interrogated
tumor samples into the identified classes.
Our approach is able to handle
datasets with no training samples (class discovery problems),
datasets where training samples exist and all the test samples belong to
the known classes (classification), as well as datasets where training
data exist but some of the test samples do not belong to any of the
known classes (joint analysis of class discovery and classification).
The method proposed is based on modeling the distribution of a
gene expression profile as a finite mixture of an unknown number of
distributions, with each mixture component characterizing the gene
expression levels within a class.
We use the expectation maximization algorithm (EM) to find the maximum
likelihood estimates of the mixture model.
Selection of the number of classes is achieved by Bayesian (BIC)
or Akaike (AIC) information criteria.
We reduce the large dimensionality by using
several measures for gene selection, and explore the
sensitivity of the class discovery and class prediction results
to the gene selection measures and the number of selected genes.
We applied our procedure to several datasets derived from
the leukemia dataset of Golub et. al (1999)
and to several simulation studies based on these datasets.
For most leukemia datasets with 80 to 150 genes, the true number of classes
is identified and the prediction accuracy rates are over 96% when
there is not more than one unknown class.
Even in the most difficult case, when there are three unknown
classes and no training data, BIC discovers the true number of classes
and the prediction accuracy rates are 89% or more for most
datasets with 50 to 200 genes.
The simulations further showed that our method is able to discover the
true number of classes and to achieve good prediction accuracy.