This file contains instructions for running EM and SEM algorithms
to get estimates of linkage parameters when heterogeneity is
suspected. The SEs of the EM and SEM estimates can also be obtained,
if desired. These programs are written in C.

********* INSTALLATION *************

untar and unzip MAXPROC.tar.Z

> zcat MAXPROC.tar.Z | tar xvf -

This will produce a directory MAXPROC/ in the current
directory. MAXPROC/ will contain: README file (this document), src/
directory (source sodes), example/ directory (containing sample input files to
run the algorithms and ensure that they are compiled correctly), and
ranlib.c/ directory. The last one contains the random number generator package,
ranlib.c that some of our programs use. You do not need to do anything
with it directly. 

Ranlib (alongwith its documentation) was downloaded from

       http://lib.stat.cmu.edu/general/Utexas/  

******** COMPILING ******************

Following are two sets of commands for compiling programs for running
EM and SEM algorithms, the first set for two-point analysis and the
second set for multipoint analysis.

Make sure you are in the directory MAXPROC/

         *** TWO-POINT ANALYSIS ***

> cc -o em_twopoint src/em_twopoint.c src/functions_twopoint.c src/se_em_twopoint.c -lm


> cc -o sem_twopoint src/sem_twopoint.c src/functions_twopoint.c src/summary_failfile.c ranlib.c/src/ranlib.c ranlib.c/src/com.c ranlib.c/linpack/linpack.c -lm

         *** MULTI-POINT ANALYSIS ***

> cc -o em_multipoint src/em_multipoint.c src/functions_multipoint.c src/se_em_multipoint.c -lm

> cc -o sem_multipoint src/sem_multipoint.c src/functions_multipoint.c src/summary_failfile.c ranlib.c/src/ranlib.c ranlib.c/src/com.c ranlib.c/linpack/linpack.c -lm

The above commands make two executable versions of the EM and SEM programs,
em_twopoint, sem_twopoint, em_multipoint, and sem_multipoint, in the
current directory MAXPROC/.

SE of EM estimates is given by the program for EM itself, if desired. 

To compile program for finding SE of SEM, use 

cc -o se_sem src/se_sem.c src/functions_multipoint.c ranlib.c/src/ranlib.c ranlib.c/src/com.c ranlib.c/linpack/linpack.c -lm 
 
This creates an executable file, se_sem, which can be used for twopoint
as well as multipoint analysis.

The following instructions are common for both two point and
multipoint analyses. The input and output files have almost same
structure for both types of analyses - any differences are mentioned
in the appropriate section. 

********** NOTATIONS **************** 

Linkage parameter = theta (represents genetic distance for multipoint
analysis). 
#groups in the sample = k
#families in i-th group = n_i, i = 1, ..., k.
Proportion of linked families in the i-th group = alpha_i, 
i = 1, ..., k. 

We first describe the input files, commands for running EM and SEM
algorithms and their output files. We go to the instructions for
getting SE of SEM at the end. 

*********** INPUT FILES for running EM and SEM ****************

EM and SEM need two and three input files, respectively. The two input
files are common to both and they are described first. 
One of these input files (second one) needs lod scores of all families at a 
user-specified grid of theta or distance (denoted as "d" in the rest of
the instructions) values, e.g., theta = 0, 0.1, 0.2, 0.3, 0.4, 0.5. These
can be obtained using the software LINKAGE or GENEHUNTER. Following is a 
description of how to prepare the input files. All fields should be
separated by a white space or newline ("enter"). 

Input file #1: infile

Enter the following information IN the given ORDER.

(1) # groups in the sample (k).
(2) # families in each group (n_1, n_2, ..., n_k).
(3) Total # of theta (or distance) values in the grid.
(4) Starting values (initial guess) of alpha_1, alpha_2, ...,
    alpha_k. E.g., alpha_i = 0.5 for all i (impartial starting values).
(5) Starting value (initial guess) of theta (d). E.g., theta = 0.1, or
    distance = middle value in the grid of distances. 
    Note: Starting value of theta (or distance) should be one of the
    values in the grid. 
(6) Stopping criteria of the algorithm (a very small number):
    iterations in the algorithms are stopped when the relative difference
    between the consecutive estimates is less than this small
    number. E.g., 0.000001.  
(7) If SE of the estimates is needed, enter 1, otherwise enter 0. 

Example: Suppose the sample consists of k = 2 groups of families. First
group has n_1 = 100 families and second group has n_2 = 200 families. 
Suppose there are a total of 6 theta values over which we want to
maximize the likelihood. We want the starting values of the algorithms to be 
alpha_1 = 0.5, alpha_2 = 0.5 and theta = 0.1, and would like to stop 
iterations if the relative difference < 0.000001. Further, suppose we 
want SE of the estimates. Then the infile should be  

2
100
200
6
0.5
0.5
0.1
0.000001
1

Input file #2: lodfile

First enter all the theta (d) values in the grid to be used for
maximizing the likelihood with respect to theta (d). 

Note: For two point analysis, this grid should be a regular
grid, i.e., the difference between any two consecutive theta values
should be same. Otherwise the program will give an error message
"ERROR: theta grid is not regular".

Next enter the lod scores of all families at all the theta (d) values in
the grid. These are entered in form of a matrix. Rows correspond
to families and columns to the theta (d) values. 
For each family, enter the lod scores over the grid in a row. So total
# of rows in this file should be equal to the total # of families in
the sample + 1  (for the theta or distance values). Make sure that all
the families in group#1 are listed first, then the families in group#2
and so on. Do not mix the families in different groups. 

Example: Suppose there are 2 groups, 1st group has 3 families and 2nd
group has 2 families. Let the grid of theta values be from 0 to
0.5 at an increment of 0.1. Let us denote the lod score of family#j of 
group#i at a given theta value as LODij(theta). Then the lodfile 
should look like the following:

   0	      0.1        0.2         0.3          0.4          0.5
LOD11(0)  LOD11(0.1)  LOD11(0.2)  LOD11(0.3)   LOD11(0.4)   LOD11(0.5)	
LOD12(0)  LOD12(0.1)  LOD12(0.2)  LOD12(0.3)   LOD12(0.4)   LOD12(0.5)
LOD13(0)  LOD13(0.1)  LOD13(0.2)  LOD13(0.3)   LOD13(0.4)   LOD13(0.5)
LOD21(0)  LOD21(0.1)  LOD21(0.2)  LOD21(0.3)   LOD21(0.4)   LOD21(0.5)  
LOD22(0)  LOD22(0.1)  LOD22(0.2)  LOD22(0.3)   LOD22(0.4)   LOD22(0.5)

Notes: 
(1) LODij(theta=0.5) should be equal to 0 for all i and j.
(2) For multipoint analysis, theta=0.5 corresponds to d=infinity, but 
    DO NOT enter d=infinity and its corresponding lod scores. 
    It is taken care of inherently in the multipoint programs. 
    However, for two point analysis, you should enter theta=0.5 and its
    corresponding lod scores (as shown above).

As noted earlier, LINKAGE or GENEHUNTER can provide the lod
scores. You may need to write a small program which would take the lod
scores from these programs as input and write them in the above format
in the lodfile after listing the theta (d) values.

Input file#3: seed_file (needed only for running SEM)

Enter two random long integers (each less than 2 billion), which are
taken as seeds by the SEM.

Example:

1057170440
1057170449


************* RUNNING EM and SEM ***************

Once the input files are ready, you can run the executable files
em_twopoint/em_multipoint and sem_twopoint/sem_multipoint. You will
need to specify the name of output file where the output will be
stored.  Suppose the name of this file is "outputfile". If you
already have file with the same name in your current directory, the
new output would be appended to it.  In addition to this main output
file, SEM generates an additional output file called "failfile". 
 
To run EM, use either 

> em_twopoint infile lodfile outputfile 

or 

> em_multipoint infile lodfile outputfile 

depending on what kind of data you have inputted.

Similarly, to run SEM, use either

> sem_twopoint infile lodfile seed_file outputfile 

or 

> sem_multipoint infile lodfile seed_file outputfile 

IMPORTANT NOTES: 

(1) If SE of EM estimates is asked for, three intermediate files are
produced with the names estimates_file, lod_se_file and zij_file. Make
sure you don't have files with these names in the directory you're
running this program, otherwise they would be OVERWRITTEN!! 

(2) As mentioned above, SEM generates an extra output file called
"failfile". Also, if SE of SEM estimates is asked for, a file named
"infile_se" is produced. Make sure you don't have file with these
names in the directory you're running this program, otherwise they
would be OVERWRITTEN!!

************ OUTPUT FILES from running EM and SEM ***********************

The "outputfile" of the two programs will give the estimates of
alpha_i, i = 1, ..., k and theta (d). In addition, for multipoint
analysis only, the SEM will also show "number of infinity distances"
in the final 100 iterations. The multipoint SEM estimates are obtained by
averaging non-infinity distances in the last 100 iterations, so by
knowing the number of non-infinity distances, you know the average is
calculated over how many numbers. 

If SEM fails (a total of 2000 restarts is allowed), then no estimates
are reported, instead, the outputfile will say "SEM failed" and is
followed by a summary of violation of boundary conditions; this may be
helpful in identifying groups with all linked or all unlinked
families. For each group, i.e., each alpha_i, three numbers are reported. Of
these three numbers, the first number tells how many times a boundary
condition corresponding to that alpha_i is violated, second number
tells how many times the condition "alpha_i not very close to 0" is
violated and the third number tells how many times the condition
"alpha_i not very close to 1" is violated. So, the sum of the second
and third numbers is the first number.    

If SE of EM is asked for, the outputfile of the EM will also
contain the information matrix, which has to be inverted to get the
variance-covariance matrix and the SE of the EM estimates. The first diagonal
element always corresponds to the linkage paramter (theta or d) and
the other diagonal elements are for the alpha_i's, in order in which
the groups were written in the lodfile.

If SE of SEM is desired, the outputfile will say at the end "You
requested for SE of the estimates: Please make lod_se_sem and then
run se_sem". The instructions for making these files are given later.

The extra output file (with name failfile) of SEM will contain a record of
failures of SEM, if any, because of violation of the boundary
conditions. This file is not directly of use to you, since a
summary of this record is written into the main outputfile if SEM
fails. Failfile simply lists the current estimates of all alpha_i's when any
boundary condition is violated. When SEM fails, this file is
summarized and reported in the main outputfile. If SEM does not fail,
the last line in this file says "closing fail file".


******************* SE OF SEM ESTIMATES *******************

Once the SEM estimates are obtained, note the value of theta_hat
(estimate of theta/d). Most likely, it will not be one of the
values in the grid of theta (d) values used for running
SEM. Do one of the two following options (if estimate is one of the grid
values, go to the option 2):

(1) Choose two points, one less than theta_hat (say, theta_minus) and
    another greater than theta_hat (say, theta_plus), but both very
    close to theta_hat, and calculate lod scores of all families at
    theta_minus, theta_hat, and theta_plus. 

(2) Round off the value of theta_hat to the nearest theta value in the
    grid. Then, choose as theta_minus and theta_plus the two nearest neighbours
    of the theta_hat in the grid. Then you can use the same lod
    scores at theta_minus, theta_hat, and theta_plus, that you used
    for SEM. NOTE: This option should be used only if the grid of
    theta values is very fine so that there is not much loss in
    precision due to rounding.

Now create a new file, say "lod_se_sem". First enter the values 
theta_minus, theta_hat, and theta_plus, in a row. Then below it, at
each row enter the lod scores of each family (families listed in the
same order as in the lodfile for running SEM) at theta_minus, theta_hat, and
theta_plus (in this order only). So, like the lodfile, this file will
also look like a matrix where the rows correspond to the families and
the columns correspond to the theta values. There will be a total of 
(# of families in the sample + 1) rows and 3 columns.  
 
You also need a file, say "seed_file_se", in which you need to enter
two integers in the same way as you did in seed_file before you ran
SEM. 

Now you are ready to obtain SE of SEM. Just check that you have the
file "infile_se" (created during SEM execution) in the current directory

To find SE, run the command: 

se_sem lod_se_sem seed_file_se outputfile 

The last file is where the information matrix of the SEM estimates
would be written. This needs to be inverted to obtain the
variance-covariance matrix and the SE of the EM estimates, as in the
case of SE of EM estimates. 

Note: The SE of SEM is calculated based on 2000 bootstrap
      samples. We expect it to give accurate estimates of SE. It it
      does not or if you want more precise estimates, you can open the
      source code se_sem.c in the src/ directory and in the third line
      where it says "#define NUM_IT 2000", change 2000 to whatever
      number you prefer and then re-compile the program.  


******************** EXAMPLE *******************************

An example of running the programs with a simulated dataset is
given in the sub-directory example/. Please look at the file
README_example in this sub-directory which gives instructions for 
executing the example.