This file contains instructions for modeling the distribution of the gene expression profile of a test sample as a mixture of distributions, with each component characterizing the expression levels in a class, and assigning a class label to each test sample. These programs are written in C. **************************************************** ****************INSTALLATION *********************** **************************************************** untar and unzip DNC-MIX.tar.z > gzip -d DNC-MIX.tar.gz | tar xvf DNC-MIX.tar This will produce a directory DNC-MIX/ in the current directory. DNC-MIX/ will contain: README file (this document), src/ directory (source codes), example/ directory (containing sample input files to run the algorithms and ensure that they are compiled correctly), and ranlib.c/ directory. The ranlib.c/ directory contains the random number generator package, ranlib.c, that some of our programs use. You do not need to do anything with it directly. Ranlib (along with its documentation) was downloaded from http://lib.stat.cmu.edu/general/Utexas/ **************************************************** ****************** COMPILING *********************** **************************************************** We distinguish the following cases: CASE 1 - Analysis USING TRAINING data ****** In this case a training dataset as well as a test dataset are available. This case is useful for handling: SUB-CASE 1.1 - CLASSIFICATION ************ -training samples exist and all the test samples belong to the known classes. -the number of mixture components equals the number of classes in the training data. SUB-CASE 1.2 - JOINT ANALYSIS OF CLASS DISCOVERY AND CLASSIFICATION ************ -training samples exist but some of the test samples do not belong to any of the known classes. -the number of mixture components is larger than the number of classes in the training data. CASE 2 - Analysis WITHOUT USING TRAINING data ****** In this case a test dataset is available, but there are no training samples. This case is useful for handling: - CLASS DISCOVERY -there are no training samples, only a test dataset is available. -the number of mixture components varies between 1 and a pre-determined integer value M_max, chosen to be sufficiently large for each specific problem. The following are the sets of commands for compiling the programs used in each of these cases: Make sure you are in the directory DNC-MIX/ *** CASE 1 / SUB-CASE 1.1 *** > cc -o MIXT_train_start src/MIXT_train_start.c ranlib.c/src/ranlib.c ranlib.c/src/com.c ranlib.c/linpack/linpack.c -lm > cc -o MIXT_read_start src/MIXT_read_start.c ranlib.c/src/ranlib.c ranlib.c/src/com.c ranlib.c/linpack/linpack.c -lm *** CASE 1 / SUB-CASE 1.2 *** > cc -o MIXT_read_start src/MIXT_read_start.c ranlib.c/src/ranlib.c ranlib.c/src/com.c ranlib.c/linpack/linpack.c -lm > cc -o MIXT_multirandom_start src/MIXT_multirandom_start.c ranlib.c/src/ranlib.c ranlib.c/src/com.c ranlib.c/linpack/linpack.c -lm *** CASE 2 *** > cc -o MIXT_notrain_read_start src/MIXT_notrain_read_start.c ranlib.c/src/ranlib.c ranlib.c/src/com.c ranlib.c/linpack/linpack.c -lm > cc -o MIXT_notrain_multirandom_start src/MIXT_notrain_multirandom_start.c ranlib.c/src/ranlib.c ranlib.c/src/com.c ranlib.c/linpack/linpack.c -lm The above commands produce in the current directory DNC-MIX/ the following executables: MIXT_train_start, MIXT_read_start, MIXT_multirandom_start, MIXT_notrain_read_start, MIXT_notrain_multirandom_start. **************************************************** **************** NOTATIONS ************************* **************************************************** K = # of known classes (K=0 if no training data is available) M = # of mixture components in the distribution of a test sample (M >= max{1, K}) U = # of unknown classes ( M = K+U) N_k = # of training samples from the k-th class, k = 1, ..., K N = # of training samples (N = N_1+ ... + N_K) T = # of test samples G = # of genes/variables whose expression levels are measured in each data sample **************************************************** **************************************************** **************************************************** Next we describe the input files, the commands for running the executables, and the output files. The following instructions are common for both, analysis with and without training data. The input and output files have almost the same structure for both types of analyzes - any differences are mentioned in the appropriate section. **************************************************** *********** INPUT FILES *************************** **************************************************** Input files : train, test, start, header_train, header_notrain, true_class, seed_in Input files "train" and "test" contain the training and test samples, respectively. Each line of these files represents a gene, and contains its expression levels among the training or test samples, respectively. Example: Suppose K=2, G=2, N_1=2, N_2=3, T=4, the gene expression levels of gene 1 in training samples from class 1 and class 2 are: 1.01, -1.32 and -0.93, 0.56, 1.67, respectively, the gene expression levels of gene 2 in training samples from class 1 and class 2 are: -1.21, -2.31 and 0.74, 2.78, 4.70, respectively, the gene expression levels of gene 1 in test samples are: 1.92, 2.45,-0.78, 3.21, and the gene expression levels of gene 2 in test samples are: 2.82, 3.57, 1.77, 3.43 Then, "train" file should be: 1.01 -1.32 -0.93 0.56 1.67 -1.21 -2.31 0.74 2.78 4.70 "test" file should be: 1.92 2.45 -0.78 3.21 2.82 3.57 1.77 3.43 Input file "start" contains initial values for the weights of the mixture, the class means and the class standard deviations. Each line of these file represents a gene g, and contains in the given order, the following: the starting values for the M weights of the mixture w_m, m=1,2,...,M, the starting values of the means mu_{gm} of gene g in classes m=1,2,...,M and the starting values of the standard deviations sigma_{gm} of gene g in classes m=1,2,...,M. Example: Suppose M=2, and the starting values are: w_1 = w_2 = 0.5 mu_{11} = -0.65398, mu_{12} = 0.7858, sigma_{11} = 0.55307, sigma_{12} = 0.79989 mu_{21} = -0.69958, mu_{22} = 0.99953, sigma_{21} = 0.56101, sigma_{22} = 0.61332 Then, file "start" file should be: 0.5 0.5 -0.65398 0.78583 0.55307 0.79989 0.5 0.5 -0.69958 0.99953 0.56101 0.61332 **************************************************** For the input file "header_train" enter the following information, in the given order: 1-st line: G = # of genes, K = # known classes, U = # unknown classes, N_max = maximum # of training samples among the K known classes, T = # of test samples 2-nd line: N_1 = no. of training samples from known class 1, the N_1 positions of the training samples from known class 1 among all training samples ... (k+1)-th line: N_k = no. of training samples from known class k, the N_k positions of the training samples from known class k among all training samples ... (K+1)-th line: N_K = no. of training samples from known class K, the N_K positions of the training samples from known class K among all training samples Example: Suppose K=3, G=30, N_1=8, N_2=10, N_3=6, T=55, and we fit the model with M=7 components (hence U=4.) Then, "header_train" file will be: 30 3 4 10 55 8 1 2 3 4 5 6 7 8 10 9 10 11 12 13 14 15 16 17 18 6 19 20 21 22 23 24 **************************************************** For the input file "header_notrain" enter the following information, in the given order, in just one line: G = # of genes, M = # of mixture components, T = # of test samples Example: Suppose G=40, T=78, and we fit the model with M=5 components. Then, file "header_notrain" should be: 40 5 78 **************************************************** Input file "true_class" contains the following: first line: true number of classes in test data among the remaining lines, each contains a permutation of the labels. Example: Suppose there are two classes present in the test samples, there are 7 test samples, the first 2 are from class 1, next 5 from class 2 and Suppose we fit a model with M=3 classes (labeled 1,2,3). Then, file "true_class" file should be: 2 1 1 2 2 2 2 2 2 2 1 1 1 1 1 1 1 3 3 3 3 3 3 3 1 1 1 1 1 3 3 2 2 2 2 2 2 2 3 3 3 3 3 **************************************************** Input file "seed_in" contains a random long integers (less than 2 billion), which is taken as seed. Example: Suppose we want to set the seed to be equal to 4127180. Then, file "seed_in" should be : 4127180 **************************************************** Input file "G" should contain the number of genes used. Example: Suppose that expression levels are measured for 25 genes. Then, file "G" should be: 25 **************************************************** *********** RUNNING the PROGRAMS ******************* **************************************************** *** CASE 1 / SUB-CASE 1.1 *** > MIXT_train_start train test start true_class OUT PRED G >& /dev/null > MIXT_read_start header_train train test start true_class OUT PRED G >& /dev/null *** CASE 1 / SUB-CASE 1.2 *** > MIXT_read_start header_train train test start true_class OUT PRED G >& /dev/null > MIXT_multirandom_start header_train seed_in train test start true_class OUT seed_out PRED G >& /dev/null *** CASE 2 *** > MIXT_notrain_read_start header_notrain test start true_class OUT PRED G >& /dev/null > MIXT_notrain_multirandom_start header_notrain seed_in test start true_class OUT seed_out PRED G >& /dev/null **************************************************** *********** OUTPUT FILES *************************** **************************************************** The output files have the same structure for both types of analyzes, and are both appended with every run of a program. OUTPUT files: OUT, PRED, seed_out **************************************************** Each line of the "OUT" file shows the following, in this order: M=#of mixture components, AIC, BIC, Prediction Accuracy Rate, Log(Likelihood), G=# genes Example: Suppose G = 45 genes, we fitted the model with M=3 components, and we obtained AIC = 115.92, BIC = 132.26, Prediction Accuracy Rate = 0.89, and log(likelihood) = -49.96. Then, file "OUT" should be: 3 115.92 132.26 0.89 -49.96 45 **************************************************** Each line of the "PRED" file shows the following, in this order: G=# genes, the class labels of test samples numbered 1,2,...,T. Example: Suppose G = 45 genes, there are 8 test samples, and we predicted the following class assignments: first two test samples are from class 3, next 4 from class 1, last three belong to classes 2,2,3. Then, file "PRED" should be: 45 3 3 1 1 1 1 2 2 3 **************************************************** File "seed_out" contains a random long integer. Example of "seed_out": 12718301 **************************************************** **************************************************** ****************************************************