This file contains instructions for modeling the distribution of the
gene expression profile of a test sample as a mixture of
distributions, with each component characterizing the expression
levels in a class, and assigning a class label to each test sample.
These programs are written in C.


****************************************************
****************INSTALLATION ***********************
****************************************************

untar and unzip DNC-MIX.tar.z

> gzip -d DNC-MIX.tar.gz | tar xvf DNC-MIX.tar


This will produce a directory DNC-MIX/ in the current directory. 
DNC-MIX/ will contain: README file (this document), src/ directory (source codes), 
example/ directory (containing sample input files to
run the algorithms and ensure that they are compiled correctly), and
ranlib.c/ directory. The ranlib.c/ directory contains the random number generator 
package, ranlib.c, that some of our programs use. You do not need to do anything
with it directly. 

Ranlib (along with its documentation) was downloaded from

       http://lib.stat.cmu.edu/general/Utexas/  


****************************************************
****************** COMPILING ***********************
****************************************************


We distinguish the following cases:

CASE 1 - Analysis USING TRAINING data
******
	 In this case a training dataset as well as a test dataset are available.
	 This case is useful for handling:

	 SUB-CASE 1.1 - CLASSIFICATION
	 ************     
		      -training samples exist and all the test samples 
		       belong to the known classes.
		      -the number of mixture components equals the 
	               number of classes in the training data.	
	         
         SUB-CASE 1.2 - JOINT ANALYSIS OF CLASS DISCOVERY AND CLASSIFICATION
         ************		   
		      -training samples exist but some of the test samples 
		       do not belong to any of the known classes. 
		      -the number of mixture components is larger than the
		       number of classes in the training data.

CASE 2 - Analysis WITHOUT USING TRAINING data
******
         In this case a test dataset is available, but there are no training samples.
         This case is useful for handling: 

		      - CLASS DISCOVERY

		      -there are no training samples, only a test dataset
		       is available.
		      -the number of mixture components varies between
		       1 and a pre-determined integer value M_max, chosen 
		       to be sufficiently large for each specific problem.


The following are the sets of commands for compiling the programs used in 
each of these cases:

Make sure you are in the directory DNC-MIX/


	  *** CASE 1 / SUB-CASE 1.1 ***

> cc -o MIXT_train_start src/MIXT_train_start.c ranlib.c/src/ranlib.c ranlib.c/src/com.c ranlib.c/linpack/linpack.c -lm

> cc -o MIXT_read_start  src/MIXT_read_start.c ranlib.c/src/ranlib.c ranlib.c/src/com.c ranlib.c/linpack/linpack.c -lm


	  *** CASE 1 / SUB-CASE 1.2 ***

> cc -o MIXT_read_start  src/MIXT_read_start.c ranlib.c/src/ranlib.c ranlib.c/src/com.c ranlib.c/linpack/linpack.c -lm

> cc -o MIXT_multirandom_start src/MIXT_multirandom_start.c ranlib.c/src/ranlib.c ranlib.c/src/com.c ranlib.c/linpack/linpack.c -lm


	  *** CASE 2 ***

> cc -o MIXT_notrain_read_start  src/MIXT_notrain_read_start.c ranlib.c/src/ranlib.c ranlib.c/src/com.c ranlib.c/linpack/linpack.c -lm

> cc -o MIXT_notrain_multirandom_start src/MIXT_notrain_multirandom_start.c ranlib.c/src/ranlib.c ranlib.c/src/com.c ranlib.c/linpack/linpack.c -lm


The above commands produce in the current directory DNC-MIX/  
the following executables: 
MIXT_train_start, MIXT_read_start, MIXT_multirandom_start,
MIXT_notrain_read_start, MIXT_notrain_multirandom_start.


****************************************************
**************** NOTATIONS ************************* 
****************************************************

K = # of known classes (K=0 if no training data is available)
M = # of mixture components in the distribution of a test sample (M >= max{1, K})
U = # of unknown classes ( M = K+U)
N_k = # of training samples from the k-th class, k = 1, ..., K
N = # of training samples (N = N_1+ ... + N_K)
T = # of test samples 
G = # of genes/variables whose expression levels are measured in each data sample


****************************************************
****************************************************
****************************************************

Next we describe the input files, the commands for running the executables,
and the output files. 

The following instructions are common for both, analysis with and
without training data.

The input and output files have almost the same structure for both 
types of analyzes - any differences are mentioned in the appropriate section. 


****************************************************
*********** INPUT FILES ***************************
****************************************************

Input files : train, test, start, header_train, header_notrain, 
	      true_class, seed_in


Input files "train" and "test" contain the training and test samples, respectively. 
Each line of these files represents a gene, and contains its expression levels 
among the training or test samples, respectively.

Example: Suppose K=2, G=2, N_1=2, N_2=3, T=4, 
the gene expression levels of gene 1 in training samples from class 1
and class 2 are: 1.01, -1.32 and -0.93, 0.56, 1.67, respectively,
the gene expression levels of gene 2 in training samples from class 1
and class 2 are: -1.21, -2.31 and 0.74, 2.78, 4.70, respectively,
the gene expression levels of gene 1 in test samples are: 1.92, 2.45,-0.78, 3.21, and
the gene expression levels of gene 2 in test samples are: 2.82, 3.57, 1.77, 3.43

Then,

"train" file should be:
1.01 -1.32 -0.93 0.56 1.67 
-1.21 -2.31 0.74 2.78 4.70

"test" file should be:
1.92 2.45 -0.78 3.21
2.82 3.57 1.77 3.43


Input file "start" contains initial values for the weights of the mixture, 
the class means and the class standard deviations.
Each line of these file represents a gene g, and contains in the given order, the following: 
the starting values for the M weights of the mixture w_m, m=1,2,...,M, 
the starting values of the means mu_{gm} of gene g in classes m=1,2,...,M and 
the starting values of the standard deviations sigma_{gm} of gene g in classes m=1,2,...,M.

Example: 
Suppose M=2, and the starting values are: 
w_1 = w_2 = 0.5
mu_{11} = -0.65398, mu_{12} = 0.7858,  sigma_{11} =  0.55307, sigma_{12} = 0.79989
mu_{21} = -0.69958, mu_{22} = 0.99953, sigma_{21} =  0.56101, sigma_{22} = 0.61332

Then, file "start" file should be:

0.5 0.5 -0.65398 0.78583 0.55307 0.79989
0.5 0.5 -0.69958 0.99953 0.56101 0.61332

****************************************************

For the input file "header_train" enter the following information, 
in the given order:
1-st line: G = # of genes, K = # known classes, U = # unknown classes, 
	   N_max = maximum # of training samples among the K known classes,
	   T = # of test samples
2-nd line:
     N_1 = no. of training samples from known class 1,
     the N_1 positions of the training samples from known class 1 
     among all training samples
...
(k+1)-th line:
	 N_k = no. of training samples from known class k,
	 the N_k positions of the training samples from known class k 
	 among all training samples
...
(K+1)-th line:
	  N_K = no. of training samples from known class K,
	  the N_K positions of the training samples from known class K 
	  among all training samples
 
Example: Suppose K=3, G=30, N_1=8, N_2=10, N_3=6, T=55, and we fit the model
with M=7 components (hence U=4.)

Then, "header_train" file will be:

30 3 4 10 55
8 1 2 3 4 5 6 7 8  
10 9 10 11 12 13 14 15 16 17 18 
6 19 20 21 22 23 24 

****************************************************

For the input file "header_notrain" enter the following information,
in the given order, in just one line:
G = # of genes, M = # of mixture components, T = # of test samples

Example: Suppose G=40, T=78, and we fit the model with M=5 components.

Then, file "header_notrain" should be:

40 5 78

****************************************************

Input file "true_class" contains the following:
first line: true number of classes in test data
among the remaining lines, each contains a permutation of the 
labels.

Example: Suppose there are two classes present in the test samples,
there are 7 test samples, the first 2 are from class 1, next 5 from class 2
and Suppose we fit a model with M=3 classes (labeled 1,2,3).

Then, file "true_class" file should be:

2
1 1 2 2 2 2 2 
2 2 1 1 1 1 1 
1 1 3 3 3 3 3 
3 3 1 1 1 1 1 
3 3 2 2 2 2 2 
2 2 3 3 3 3 3 

****************************************************

Input file "seed_in" contains a random long integers (less than 2
billion), which is taken as seed.

Example: Suppose we want to set the seed to be equal to 4127180.

Then, file "seed_in" should be :

4127180

****************************************************

Input file "G" should contain the number of genes used.

Example: Suppose that expression levels are measured for 25 genes.

Then, file "G" should be:

25

****************************************************
*********** RUNNING the PROGRAMS *******************
****************************************************


	    *** CASE 1 / SUB-CASE 1.1 ***
	    
> MIXT_train_start train test start true_class OUT PRED G >& /dev/null

> MIXT_read_start header_train train test start true_class OUT PRED G >& /dev/null


	  *** CASE 1 / SUB-CASE 1.2 ***

> MIXT_read_start header_train train test start true_class OUT PRED G >& /dev/null

> MIXT_multirandom_start header_train seed_in train test start true_class OUT seed_out PRED G >& /dev/null


	  *** CASE 2 ***

> MIXT_notrain_read_start header_notrain test start true_class OUT PRED G >& /dev/null

> MIXT_notrain_multirandom_start header_notrain seed_in test start true_class OUT seed_out PRED G >& /dev/null


****************************************************
*********** OUTPUT FILES ***************************
****************************************************

The output files have the same structure for both types of analyzes,
and are both appended with every run of a program.

OUTPUT files: OUT, PRED, seed_out

****************************************************

Each line of the "OUT" file shows the following, in this order:
M=#of mixture components, AIC, BIC, Prediction Accuracy Rate, Log(Likelihood), G=# genes

Example: Suppose G = 45 genes, we fitted the model with M=3 components, 
and we obtained AIC = 115.92, BIC = 132.26, Prediction Accuracy Rate = 0.89, and
log(likelihood) = -49.96.

Then, file "OUT" should be:

3  115.92  132.26  0.89  -49.96 45

****************************************************

Each line of the "PRED" file shows the following, in this order:
G=# genes, the class labels of test samples numbered 1,2,...,T.

Example: Suppose G = 45 genes, there are 8 test samples, and we
predicted the following class assignments: first two test samples are
from class 3, next 4 from class 1, last three belong to classes 2,2,3.

Then, file "PRED" should be:

45  3 3 1 1 1 1 2 2 3 


****************************************************

File "seed_out" contains a random long integer.

Example of "seed_out":

12718301

****************************************************
****************************************************
****************************************************