This file contains instructions for running EM and SEM algorithms to get estimates of linkage parameters when heterogeneity is suspected. The SEs of the EM and SEM estimates can also be obtained, if desired. These programs are written in C. ********* INSTALLATION ************* untar and unzip MAXPROC.tar.Z > zcat MAXPROC.tar.Z | tar xvf - This will produce a directory MAXPROC/ in the current directory. MAXPROC/ will contain: README file (this document), src/ directory (source sodes), example/ directory (containing sample input files to run the algorithms and ensure that they are compiled correctly), and ranlib.c/ directory. The last one contains the random number generator package, ranlib.c that some of our programs use. You do not need to do anything with it directly. Ranlib (alongwith its documentation) was downloaded from http://lib.stat.cmu.edu/general/Utexas/ ******** COMPILING ****************** Following are two sets of commands for compiling programs for running EM and SEM algorithms, the first set for two-point analysis and the second set for multipoint analysis. Make sure you are in the directory MAXPROC/ *** TWO-POINT ANALYSIS *** > cc -o em_twopoint src/em_twopoint.c src/functions_twopoint.c src/se_em_twopoint.c -lm > cc -o sem_twopoint src/sem_twopoint.c src/functions_twopoint.c src/summary_failfile.c ranlib.c/src/ranlib.c ranlib.c/src/com.c ranlib.c/linpack/linpack.c -lm *** MULTI-POINT ANALYSIS *** > cc -o em_multipoint src/em_multipoint.c src/functions_multipoint.c src/se_em_multipoint.c -lm > cc -o sem_multipoint src/sem_multipoint.c src/functions_multipoint.c src/summary_failfile.c ranlib.c/src/ranlib.c ranlib.c/src/com.c ranlib.c/linpack/linpack.c -lm The above commands make two executable versions of the EM and SEM programs, em_twopoint, sem_twopoint, em_multipoint, and sem_multipoint, in the current directory MAXPROC/. SE of EM estimates is given by the program for EM itself, if desired. To compile program for finding SE of SEM, use cc -o se_sem src/se_sem.c src/functions_multipoint.c ranlib.c/src/ranlib.c ranlib.c/src/com.c ranlib.c/linpack/linpack.c -lm This creates an executable file, se_sem, which can be used for twopoint as well as multipoint analysis. The following instructions are common for both two point and multipoint analyses. The input and output files have almost same structure for both types of analyses - any differences are mentioned in the appropriate section. ********** NOTATIONS **************** Linkage parameter = theta (represents genetic distance for multipoint analysis). #groups in the sample = k #families in i-th group = n_i, i = 1, ..., k. Proportion of linked families in the i-th group = alpha_i, i = 1, ..., k. We first describe the input files, commands for running EM and SEM algorithms and their output files. We go to the instructions for getting SE of SEM at the end. *********** INPUT FILES for running EM and SEM **************** EM and SEM need two and three input files, respectively. The two input files are common to both and they are described first. One of these input files (second one) needs lod scores of all families at a user-specified grid of theta or distance (denoted as "d" in the rest of the instructions) values, e.g., theta = 0, 0.1, 0.2, 0.3, 0.4, 0.5. These can be obtained using the software LINKAGE or GENEHUNTER. Following is a description of how to prepare the input files. All fields should be separated by a white space or newline ("enter"). Input file #1: infile Enter the following information IN the given ORDER. (1) # groups in the sample (k). (2) # families in each group (n_1, n_2, ..., n_k). (3) Total # of theta (or distance) values in the grid. (4) Starting values (initial guess) of alpha_1, alpha_2, ..., alpha_k. E.g., alpha_i = 0.5 for all i (impartial starting values). (5) Starting value (initial guess) of theta (d). E.g., theta = 0.1, or distance = middle value in the grid of distances. Note: Starting value of theta (or distance) should be one of the values in the grid. (6) Stopping criteria of the algorithm (a very small number): iterations in the algorithms are stopped when the relative difference between the consecutive estimates is less than this small number. E.g., 0.000001. (7) If SE of the estimates is needed, enter 1, otherwise enter 0. Example: Suppose the sample consists of k = 2 groups of families. First group has n_1 = 100 families and second group has n_2 = 200 families. Suppose there are a total of 6 theta values over which we want to maximize the likelihood. We want the starting values of the algorithms to be alpha_1 = 0.5, alpha_2 = 0.5 and theta = 0.1, and would like to stop iterations if the relative difference < 0.000001. Further, suppose we want SE of the estimates. Then the infile should be 2 100 200 6 0.5 0.5 0.1 0.000001 1 Input file #2: lodfile First enter all the theta (d) values in the grid to be used for maximizing the likelihood with respect to theta (d). Note: For two point analysis, this grid should be a regular grid, i.e., the difference between any two consecutive theta values should be same. Otherwise the program will give an error message "ERROR: theta grid is not regular". Next enter the lod scores of all families at all the theta (d) values in the grid. These are entered in form of a matrix. Rows correspond to families and columns to the theta (d) values. For each family, enter the lod scores over the grid in a row. So total # of rows in this file should be equal to the total # of families in the sample + 1 (for the theta or distance values). Make sure that all the families in group#1 are listed first, then the families in group#2 and so on. Do not mix the families in different groups. Example: Suppose there are 2 groups, 1st group has 3 families and 2nd group has 2 families. Let the grid of theta values be from 0 to 0.5 at an increment of 0.1. Let us denote the lod score of family#j of group#i at a given theta value as LODij(theta). Then the lodfile should look like the following: 0 0.1 0.2 0.3 0.4 0.5 LOD11(0) LOD11(0.1) LOD11(0.2) LOD11(0.3) LOD11(0.4) LOD11(0.5) LOD12(0) LOD12(0.1) LOD12(0.2) LOD12(0.3) LOD12(0.4) LOD12(0.5) LOD13(0) LOD13(0.1) LOD13(0.2) LOD13(0.3) LOD13(0.4) LOD13(0.5) LOD21(0) LOD21(0.1) LOD21(0.2) LOD21(0.3) LOD21(0.4) LOD21(0.5) LOD22(0) LOD22(0.1) LOD22(0.2) LOD22(0.3) LOD22(0.4) LOD22(0.5) Notes: (1) LODij(theta=0.5) should be equal to 0 for all i and j. (2) For multipoint analysis, theta=0.5 corresponds to d=infinity, but DO NOT enter d=infinity and its corresponding lod scores. It is taken care of inherently in the multipoint programs. However, for two point analysis, you should enter theta=0.5 and its corresponding lod scores (as shown above). As noted earlier, LINKAGE or GENEHUNTER can provide the lod scores. You may need to write a small program which would take the lod scores from these programs as input and write them in the above format in the lodfile after listing the theta (d) values. Input file#3: seed_file (needed only for running SEM) Enter two random long integers (each less than 2 billion), which are taken as seeds by the SEM. Example: 1057170440 1057170449 ************* RUNNING EM and SEM *************** Once the input files are ready, you can run the executable files em_twopoint/em_multipoint and sem_twopoint/sem_multipoint. You will need to specify the name of output file where the output will be stored. Suppose the name of this file is "outputfile". If you already have file with the same name in your current directory, the new output would be appended to it. In addition to this main output file, SEM generates an additional output file called "failfile". To run EM, use either > em_twopoint infile lodfile outputfile or > em_multipoint infile lodfile outputfile depending on what kind of data you have inputted. Similarly, to run SEM, use either > sem_twopoint infile lodfile seed_file outputfile or > sem_multipoint infile lodfile seed_file outputfile IMPORTANT NOTES: (1) If SE of EM estimates is asked for, three intermediate files are produced with the names estimates_file, lod_se_file and zij_file. Make sure you don't have files with these names in the directory you're running this program, otherwise they would be OVERWRITTEN!! (2) As mentioned above, SEM generates an extra output file called "failfile". Also, if SE of SEM estimates is asked for, a file named "infile_se" is produced. Make sure you don't have file with these names in the directory you're running this program, otherwise they would be OVERWRITTEN!! ************ OUTPUT FILES from running EM and SEM *********************** The "outputfile" of the two programs will give the estimates of alpha_i, i = 1, ..., k and theta (d). In addition, for multipoint analysis only, the SEM will also show "number of infinity distances" in the final 100 iterations. The multipoint SEM estimates are obtained by averaging non-infinity distances in the last 100 iterations, so by knowing the number of non-infinity distances, you know the average is calculated over how many numbers. If SEM fails (a total of 2000 restarts is allowed), then no estimates are reported, instead, the outputfile will say "SEM failed" and is followed by a summary of violation of boundary conditions; this may be helpful in identifying groups with all linked or all unlinked families. For each group, i.e., each alpha_i, three numbers are reported. Of these three numbers, the first number tells how many times a boundary condition corresponding to that alpha_i is violated, second number tells how many times the condition "alpha_i not very close to 0" is violated and the third number tells how many times the condition "alpha_i not very close to 1" is violated. So, the sum of the second and third numbers is the first number. If SE of EM is asked for, the outputfile of the EM will also contain the information matrix, which has to be inverted to get the variance-covariance matrix and the SE of the EM estimates. The first diagonal element always corresponds to the linkage paramter (theta or d) and the other diagonal elements are for the alpha_i's, in order in which the groups were written in the lodfile. If SE of SEM is desired, the outputfile will say at the end "You requested for SE of the estimates: Please make lod_se_sem and then run se_sem". The instructions for making these files are given later. The extra output file (with name failfile) of SEM will contain a record of failures of SEM, if any, because of violation of the boundary conditions. This file is not directly of use to you, since a summary of this record is written into the main outputfile if SEM fails. Failfile simply lists the current estimates of all alpha_i's when any boundary condition is violated. When SEM fails, this file is summarized and reported in the main outputfile. If SEM does not fail, the last line in this file says "closing fail file". ******************* SE OF SEM ESTIMATES ******************* Once the SEM estimates are obtained, note the value of theta_hat (estimate of theta/d). Most likely, it will not be one of the values in the grid of theta (d) values used for running SEM. Do one of the two following options (if estimate is one of the grid values, go to the option 2): (1) Choose two points, one less than theta_hat (say, theta_minus) and another greater than theta_hat (say, theta_plus), but both very close to theta_hat, and calculate lod scores of all families at theta_minus, theta_hat, and theta_plus. (2) Round off the value of theta_hat to the nearest theta value in the grid. Then, choose as theta_minus and theta_plus the two nearest neighbours of the theta_hat in the grid. Then you can use the same lod scores at theta_minus, theta_hat, and theta_plus, that you used for SEM. NOTE: This option should be used only if the grid of theta values is very fine so that there is not much loss in precision due to rounding. Now create a new file, say "lod_se_sem". First enter the values theta_minus, theta_hat, and theta_plus, in a row. Then below it, at each row enter the lod scores of each family (families listed in the same order as in the lodfile for running SEM) at theta_minus, theta_hat, and theta_plus (in this order only). So, like the lodfile, this file will also look like a matrix where the rows correspond to the families and the columns correspond to the theta values. There will be a total of (# of families in the sample + 1) rows and 3 columns. You also need a file, say "seed_file_se", in which you need to enter two integers in the same way as you did in seed_file before you ran SEM. Now you are ready to obtain SE of SEM. Just check that you have the file "infile_se" (created during SEM execution) in the current directory To find SE, run the command: se_sem lod_se_sem seed_file_se outputfile The last file is where the information matrix of the SEM estimates would be written. This needs to be inverted to obtain the variance-covariance matrix and the SE of the EM estimates, as in the case of SE of EM estimates. Note: The SE of SEM is calculated based on 2000 bootstrap samples. We expect it to give accurate estimates of SE. It it does not or if you want more precise estimates, you can open the source code se_sem.c in the src/ directory and in the third line where it says "#define NUM_IT 2000", change 2000 to whatever number you prefer and then re-compile the program. ******************** EXAMPLE ******************************* An example of running the programs with a simulated dataset is given in the sub-directory example/. Please look at the file README_example in this sub-directory which gives instructions for executing the example.