This file contains instructions for modeling the distribution of the
gene expression profile of a test sample as a mixture of
distributions, with each component characterizing the expression
levels in a class, and assigning a class label to each test sample.
These programs are written in C.


****************************************************
****************INSTALLATION ***********************
****************************************************

untar and unzip DNC-MIX.tar.z

> gzip -d DNC-MIX.tar.gz | tar xvf DNC-MIX.tar


This will produce a directory DNC-MIX/ in the current directory. 
DNC-MIX/ will contain: README file (this document), src/ directory (source codes), 
example/ directory (containing sample input files to
run the algorithms and ensure that they are compiled correctly), and
ranlib.c/ directory. The ranlib.c/ directory contains the random number generator 
package, ranlib.c, that some of our programs use. You do not need to do anything
with it directly. 

Ranlib (along with its documentation) was downloaded from

       http://lib.stat.cmu.edu/general/Utexas/  



****************************************************
****************** COMPILING ***********************
****************************************************


We distinguish the following cases:

CASE 1 - Analysis USING TRAINING data
     In this case a training dataset as well as a test dataset are available.
     This case is useful for handling:
     SUB-CASE 1.1 - CLASSIFICATION problems:
		    -training samples exist and all the test samples 
		     belong to the known classes.
		    -the number of mixture components equals the 
	             number of classes in the training data.		         
     SUB-CASE 1.2 - JOINT ANALYSIS OF CLASS DISCOVERY AND CLASSIFICATION
		    -training samples exist but some of the test samples 
		     do not belong to any of the known classes. 
		    -the number of mixture components is larger than the
		     number of classes in the training data.

CASE 2 - Analysis WITHOUT USING TRAINING data
     This case is useful for handling: 
		 - CLASS DISCOVERY problems
		     -there are no training samples, only a test dataset
		      is available.
		     -the number of mixture components varies between
		      1 and a pre-determined integer value M_max, chosen to be
		      sufficiently large for each specific problem.


The following are the sets of commands for compiling the programs used in 
each of these cases:

Make sure you are in the directory DC_MIX/


	  *** CASE 1 / SUB-CASE 1.1 ***

> cc -o MIXT_train_start src/MIXT_train_start.c ranlib.c/src/ranlib.c ranlib.c/src/com.c ranlib.c/linpack/linpack.c -lm

> cc -o MIXT_read_start  src/MIXT_read_start.c ranlib.c/src/ranlib.c ranlib.c/src/com.c ranlib.c/linpack/linpack.c -lm


	  *** CASE 1 / SUB-CASE 1.2 ***

> cc -o MIXT_read_start  src/MIXT_read_start.c ranlib.c/src/ranlib.c ranlib.c/src/com.c ranlib.c/linpack/linpack.c -lm

> cc -o MIXT_multirandom_start src/MIXT_multirandom_start.c ranlib.c/src/ranlib.c ranlib.c/src/com.c ranlib.c/linpack/linpack.c -lm


	  *** CASE 2 ***

> cc -o MIXT_notrain_read_start  src/MIXT_notrain_read_start.c ranlib.c/src/ranlib.c ranlib.c/src/com.c ranlib.c/linpack/linpack.c -lm

> cc -o MIXT_notrain_multirandom_start src/MIXT_notrain_multirandom_start.c ranlib.c/src/ranlib.c ranlib.c/src/com.c ranlib.c/linpack/linpack.c -lm


The above commands produce, in the current directory DC_MIX/ ,  
the executables: MIXT_train_start, MIXT_read_start, MIXT_multirandom_start,
MIXT_notrain_read_start, MIXT_notrain_multirandom_start.





****************************************************
**************** NOTATIONS ************************* 
****************************************************

K = # of known classes (K=0 if no training data is available.)
M = # of mixture components in the distribution of a test sample (M >= max{1, K})
U = # of unknown classes ( M = K+U.)
N_k = # of training samples from the k-th class, k = 1, ..., k.
N = # of training samples (N = N_1+ ... + N_K.)
T = # of test samples 
G = # of genes/variables whose expression levels are measured in each data sample



****************************************************

Next we describe the input files, the commands for running the programs,
and the output files. 

The following instructions are common for both, analysis with and
without training data.
The input and output files have almost the same structure for both 
types of analyzes - any differences are mentioned in the appropriate section. 



****************************************************
*********** INPUT FILES ***************************
****************************************************

Input files : train, test, start, header_train, header_notrain, 
	      true_class, seed_in


Input files "train" and "test" contain the
training and test samples respectively. Each line of these files
represents a gene, and contains its expression levels among the
training or test samples, respectively.

Example: suppose 
K=2, G=2, N_1=2, N_2=3, T=4, 
the gene expression levels of gene 1 in training samples from class 1
and class 2 are: 1.01, -1.32 and -0.93, 0.56, 1.67 respectively.
the gene expression levels of gene 2 in training samples from class 1
and class 2 are: -1.21, -2.31 and 0.74, 2.78, 4.70 respectively.
the gene expression levels of gene 1 in test samples are: 1.92, 2.45,-0.78, 3.21
the gene expression levels of gene 2 in test samples are: 2.82, 3.57, 1.77, 3.43

Then "train" file should be:
1.01 -1.32 -0.93 0.56 1.67 
-1.21 -2.31 0.74 2.78 4.70

"test" file should be:
1.92 2.45 -0.78 3.21
2.82 3.57  1.77 3.43


Input file "start" contains initial values for the weights of the
mixture, the class means and the class standard deviations.
Each line of these file represents a gene g, and contains in the given
order, the following: 
the starting values for the M weights of the mixture w_m, m=1,2,...,M, 
the starting values of the means mu_{gm} of gene g in classes m=1,2,...,M and next, 
the starting values of the standard deviations sigma_{gm} of gene g in classes m=1,2,...,M.

Example: 
suppose M=2, and the starting values are: 
w_1 = w_2 = 0.5
mu_{11} = -0.65398, mu_{12} = 0.7858, sigma_{11} =  0.55307, sigma_{12} = 0.79989
mu_{21} = -0.69958, mu_{22} = 0.99953, sigma_{21} =  0.56101, sigma_{22} = 0.61332
Then "start" file should like like:
0.5 0.5 -0.65398 0.78583 0.55307 0.79989
0.5 0.5 -0.69958 0.99953 0.56101 0.61332



For the input file "header_train" enter the following information, in
the given order:
1-st line: G = # of genes, K = # known classes, U = # unknown classes, 
	   N_max = maximum # of training samples among the K known classes,
	   T = # of test samples
2-nd line:
     N_1 = no. of training samples from known class 1,
     the N_1 positions of the training samples from known class 1 
     among all training samples
...
(k+1)-th line:
	 N_k = no. of training samples from known class k,
	 the N_k positions of the training samples from known class k 
	 among all training samples
...
(K+1)-th line:
	  N_K = no. of training samples from known class K,
	  the N_K positions of the training samples from known class K 
	  among all training samples
 
Example: 
suppose K=3, G=30, N_1=8, N_2=10, N_3=6, T=34, and we fit the model
with M=7 components (hence U=4.)

Then, "header_train" file will be:
30 3 4 8 34
15 1 2 3 4 5 6 7 8  
15 9 10 11 12 13 14 15 16 17 18 
6 19 20 21 22 23 24 



For the input file "header_notrain" enter the following information,
in the given order, in just one line:
G = # of genes, M = # of mixture components, T = # of test samples

Example: 
suppose G=40, T=78, and we fit the model with M=5 components.
Then, "header_notrain" file will be:
40 5 78



Input file "true_class" contains the following:
first line: true number of classes in test data
among the remaining lines, each contains a permutation of the 
labels.

Example: suppose 
there are two classes present in the test samples,
there are 7 test samples, the first 2 are from class 1, next 5 from class 2
and suppose we fit a model with M=3 classes (labeled 1,2,3).

then, "true_class" file should be:
2
1 1 2 2 2 2 2 
2 2 1 1 1 1 1 
1 1 3 3 3 3 3 
3 3 1 1 1 1 1 
3 3 2 2 2 2 2 
2 2 3 3 3 3 3 



Input file "seed_in" contains a random long integers (less than 2
billion), which is taken as seed.

Example: suppose we want to set the seed for .Random.seed = 4127180
then file "seed_in" could look like:
4127180


Input file "G" should contain the number of genes used.

Example: suppose that expression levels are measured for 25 genes.
Then, file "G" looks like:
25

****************************************************
*********** RUNNING the PROGRAMS *******************
****************************************************


	    *** CASE 1 / SUB-CASE 1.1 ***
	    
> MIXT_train_start train test start true_class OUT PRED G >& /dev/null

> MIXT_read_start header_train train test start true_class OUT PRED G >& /dev/null


	  *** CASE 1 / SUB-CASE 1.2 ***

> MIXT_read_start header_train train test start true_class OUT PRED G >& /dev/null

> MIXT_multirandom_start header_train seed_in train test start true_class OUT seed_out PRED G >& /dev/null


	  *** CASE 2 ***

> MIXT_notrain_read_start header_notrain test start true_class OUT PRED G >& /dev/null

> MIXT_notrain_multirandom_start header_notrain seed_in test start true_class OUT seed_out PRED G >& /dev/null

****************************************************


****************************************************
*********** OUTPUT FILES ***************************
****************************************************

The output files have the same structure for both types of analyzes,
and are both appended by another output line with every run of a program.

OUTPUT files: OUT, PRED, seed_out

Each line of the OUT file shows in this order:
M=#of mixture components, AIC, BIC, Prediction Accuracy Rate, Log(Likelihood), G=# genes

Example: suppose
G = 45 genes, we fitted the model with M=3 components, and we obtained
AIC = 115.92,  BIC = 132.26, Prediction Accuracy Rate = 0.89,
log(likelihood) = -49.96.

Then "OUT" should look like:
3  115.92  132.26  0.89    -49.96 45



Each line of the PRED file shows in this order:
G=# genes, the class labels of test samples numbered 1,2,...,T.

Example: suppose G = 45 genes, there are 8 test samples, and we
predicted the following class assignments: first two test samples are
from class 3, next 4 from class 1, last three belong to classes 2,1,3.

the, "PRED" should look like"
45  3 3 1 1 1 1 2 1 3 


"seed_out" contains a random long integers

Example of "seed_out":
12718301

****************************************************
****************************************************
****************************************************

