Title: Cancer Class Discovery and Classification using Microarray Gene Expression Data

Abstract: Class discovery and classification are of crucial importance for determining efficient therapies in cancer treatment. Recent studies have shown that DNA microarray technology is a powerful tool for discovery and classification of cancer types and subtypes. We propose a unified approach to discovery of putative classes of cancers and classification of the interrogated tumor samples into the identified classes. Our approach is able to handle datasets with no training samples (class discovery problems), datasets where training samples exist and all the test samples belong to the known classes (classification), as well as datasets where training data exist but some of the test samples do not belong to any of the known classes (joint analysis of class discovery and classification). The method proposed is based on modeling the distribution of a gene expression profile as a finite mixture of an unknown number of distributions, with each mixture component characterizing the gene expression levels within a class. We use the expectation maximization algorithm (EM) to find the maximum likelihood estimates of the mixture model. Selection of the number of classes is achieved by Bayesian (BIC) or Akaike (AIC) information criteria. We reduce the large dimensionality by using several measures for gene selection, and explore the sensitivity of the class discovery and class prediction results to the gene selection measures and the number of selected genes. We applied our procedure to several datasets derived from the leukemia dataset of Golub et. al (1999) and to several simulation studies based on these datasets. For most leukemia datasets with 80 to 150 genes, the true number of classes is identified and the prediction accuracy rates are over 96% when there is not more than one unknown class. Even in the most difficult case, when there are three unknown classes and no training data, BIC discovers the true number of classes and the prediction accuracy rates are 89% or more for most datasets with 50 to 200 genes. The simulations further showed that our method is able to discover the true number of classes and to achieve good prediction accuracy.