This file contains instructions for running the CSI method as it is described in Papachristou and Lin (Genet Epidemiol 2006; 30:3-17 and Genet Epidemiol 2006; 30:18-29). It also contains two supplementary programs (MLEs_IBD and BOOTS). MLEs_IBD estimates the MLE's of the IBD sharing probabilities by two affected sibs, using the likelihood method described by Risch (Am J Hum Genet 1990; 46:242-253). BOOTS provides parametric and non-parametric bootstrap intervals for the location of a putative trait locus based on the above likelihood function. These programs are written in C. However, CSI assumes that the software environment R is available. This software package can be freely downloaded from the following web page: http://www.r-project.org/ ********* INSTALLATION ************* untar and unzip CSI.tar.z > gzip -d CSI.tar.gz | tar xvf CSI.tar This will produce a directory CSI/ in the current directory. CSI/ will contain: 1. README : file (this document), 2. src/ : directory (source codes), 3. lib/ : collection of routines 4. example/ : directory (containing sample input files to run the algorithms and ensure that they are compiled correctly), 5. CSI : Pre-compiled CSI program fit for use in linux 6. MLEs_IBD : Pre-compiled MLE's program for linux environment 6. BOOTS : Pre-compiled BOOTS program for linux environment ******** COMPILING ****************** If for any reason the pre-compiled versions do not run, you can recompile the programs. First you need to compile the necessary libraries. Descend to subdirectory CSI/lib and type the following command > make The above command will automatically create all libraries you need. Next, move to the CSI/src directory and again type > make This command will automatically create new executable files for the CSI and the MLEs_IBD and move them to parent directory (CSI/) ********** Getting Started ********* We first describe the input files for running CSI (See also files in directory "examples' *********** INPUT FILES for running CSI **************** CSI needs one input file to run (CSI_info). Following is a description of the information contained in the in-file. LINE 1: name of marker description file LINE 2: name of pedigree file LINE 3: name of file where results will be stored LINE 4: the three IBD sharing probabilities at trait locus separated by space(s) or tabs LINE 5: Map function used for conversion of recombination fractions (0 Haldane, 1 Koshambi) LINE 6: "P S I O" - method of defining points where analysis will be performed. if P=1, then analysis will be performed on all markers and "S" locations in between consecutive markers. If P=0, then analysis will be be performed on all markers and as many points as needed in between the markers, in increments of "I". Finally, "O" controls how far off the ends of the map we should extend the analysis LINE 7: CSI version (0 for original CSI), # MC replicates for the multipoint analysis LINE 8: alpha = (1-level) of interval to be computed LINE 9: Maximum percentage of Markers allowed to be missing per person ** NOTE: additional information at the end of each line is ignored Example: Say that we are trying to construct a 85% confidence region for a trait locus that we know the the IBD sharing distribution at the trait locus is (z_0,z_1,z_2)=(.1,.4,.5). Let marker.loc be the marker description file, ped_file.txt be the file with the sample genotypes. Further, assume that we want to use the CSI-v3 to construct the confidence region, and we want to use the scores at all markers plus 9 locations between consecutive markers. We also want to use 10,000 MC replicates to estimate the necessary parameters under the null, and want to store the results of the analysis in the file out.txt. Finally, we want to include as if they were completely genotyped, people who miss at most 10% of their marker genotypes. The info file should look marker.loc << name of marker description file ped_file.txt << name of pedigree file out.txt << File where results will be stored .1 .4 .5 << IBD sharing probabilities at trait locus (z_0,z_1,z_2) 0 << Map function (0 Haldane, 1 Koshambi) 1 9 .5 0 << Points, Steps, Increment, offend 3 10000 << CSI version (0 for original CSI), MC replicates .15 << alpha .10 << percent missing markers allowed Note that everything after the symbols "<<", are just comments and they will be ignored by the program. ********************** MARKER DESCRIPTION FILE ************************* Marker description file: marker.loc This file provides description information on the marker data (allele frequencies for each genetic marker, frequency and penetrance information for the disease) and has exactly the same format as the linkloci.dat file used by GENEHUNTER. The format of this file must be identical to the Linkage parameter file (output from the PREPLINK program). See the file MS.loc as an example of this file format or consult Linkage documentation for further help. The already compiled version of the CSI can handle up to 1,000 markers at a time, with an arbitrary number of alleles, and a total of 5,000 points. For larger number of pairs the user may recompile the program changing the limits. **********************PEDIGREE FILE************************* Family Genotypes file: ped_file.dat This file is identical to the pedigree file used in GENEHUNTER. The pedigree should be in the Linkage pedigree input format (before running MAKEPED or doing any preprocessing!). Each line of this file must have the following structure: 3 12 8 9 1 2 1 1 2 8 3 0 0 4 6 1 3 ... 4.10 0.374 (a) (b) (c) (d) (e) (f) (g) (h ------------------------) (i -------) (a) pedigree name (b) individual ID # (c) father's ID # (d) mother's ID # (e) sex (1=MALE, 2=FEMALE) (f) affectation status (1=UNAFFECTED, 2=AFFECTED) (g) liability class (OPTIONAL) - classes specified in marker data file (h) marker genotypes (i) phenotype/covariate data (OPTIONAL) A 0 in any of the disease phenotype or marker genotype positions (as in the the genotypes for the third marker above) indicates missing data. See the file linkped.pre as an example. A - in the phenotype/covariate data indicates missing data - NB: 0 is a real value that a phenotype may take on and DOES NOT represent missing phenotype data The already compiled version of the CSI can handle up to 1,000 affected sib pairs at once. For larger number of pairs the user may recompile the program changing the limits. ******************** Output Information ******************** The CSI program outpouts several pieces of information on the output file: Families: 100 with 0 Parent(s) 100 with 1 Parent(s) 100 with 2 Parent(s) Method: CSI-V3 Locus Obs. mu sigma z_score P-value 12.50 1.3435 1.737 0.0204 -19.23 0.000000 14.38 1.3582 1.695 0.0218 -15.43 0.000000 16.25 1.3750 1.682 0.0222 -13.82 0.000000 18.13 1.3938 1.712 0.0218 -14.61 0.000000 . . . . . . 42.50 1.7461 1.791 0.0200 -2.23 0.026059 44.38 1.6854 1.721 0.0217 -1.64 0.101067 46.25 1.6285 1.704 0.0218 -3.45 0.000570 . . . . . . 87.50 1.1175 1.737 0.0203 -30.47 0.000000 95% Confidence region(s): 36.791 42.273 43.350 44.707 95% Confidence region(s) - Smoothed: 36.791 42.273 43.350 44.707 First it outputs the number of nuclear families used in the analysis, This number reflects the total number of ASPs created after bigger sib-sips were broken down. Locus : location where score computed. Obs. : Observed mean IBD sharing by the ASP's in sample mu : Expect IBD sharing at locus under null hypothesis sigma : SD of IBD sharing at locus under null hypothesis z_score: Standardized score P_value: assuming normality. Finally, the program provides two CR for the location of the trait locus. The first is based on linear approximation of the missing scores between consecutive loci in the analysis, while the second based on a fitted smooth spline. The program also creates a file "smoothed_scores.eps" where it graphs the standardized scores for locations in the analysis (dashed line) as well as the smoothed spline fitted through these scores (solid line). The two horizontal lines mark the two cutoff points +-z_a ******************** Running CSI ******************************* To run CSI, after compiling, simply type the following command > CSI CSI_info *********************************************************************** ******************** MAXIMUM LIKELIHOOD ******************************* *********************************************************************** The MLEs_IBD program also takes one input file: mls_info. This file contains the following information: LINE 1: name of marker description file LINE 2: name of pedigree file LINE 3: name of file where results will be stored LINE 4: Lower, upper locations where MLEs should be computed, as well as the increment for all the positions in between them LINE 5: Map function used for conversion of recombination fractions ** NOTE: additional information at the end of each line is ignored EXAMPLE: Compute the MLE's for the IBD probabilities starting from location 50, ending in location 60, in increments of 2.5 cM. The marker descriptions are in file MS.loc, and the genotypes are stored in file MS.ped. The mle_info file for the above analysis should look like MS.loc << Marker Description File MS.ped << Pedigree genotypes MLE.out << File where results will be stored 50 60 2.5 << lower limit, upper limit, increment 0 << Map conversion (0=Haldane, 1=Koshambi) ******************** Output Information ******************** This is a sample output of the MLEs_IBD program: Sample Size= 300 Loc(cM) LOD z0 z1 z2 50.00 43.84 0.023203 0.363983 0.612813 **** 52.50 39.46 0.019953 0.365610 0.614437 55.00 31.62 0.029037 0.399878 0.571085 57.50 21.74 0.065955 0.450420 0.483626 60.00 21.06 0.059320 0.443626 0.497054 Note: The four stars mark the location where the maximum occurred *********************************************************************** **************** Bootstrap Confidence Intervals ********************** *********************************************************************** The program also takes one input file: BOOT_info. This file contains the following information: LINE 1: name of marker description file LINE 2: name of pedigree file LINE 3: name of file where results will be stored LINE 4: version of the bootstrap: 0-non parametric(NPB), 1-Parametric(PMB) LINE 5: Number of bootstrap samples B used for constructing emperical dist. LINE 6: alpha=1-Coverage probability of region LINE 7: Map conversion recombination function (0-Haldane,1-Kosambi) LINE 8: # of points b/w markers that MLEs will be computed, offend distance LINE 9: PMB only: MLEs computed from current data(0) or fixed by researcher(1) LINE 10: If previous line 1, location tau of trait locus LINE 11: trait IBD distribution at above location(only for PMB). ** NOTE: additional information at the end of each line is ignored ** NOTE: Lines 10-11 are ignored if line 9 is set 0, or line 4 is 0. in addition line 9 is ignored if line 4 is 0. EXAMPLE 1: Compute a 90% CI for trait locus uing the NPB. For the maximazation use 10 points inbetween markers. The map description is in file MS.loc, and the genotypes are stored in file MS.ped. Use 400 boostrap replications to approximate emperical distribution. The BOOT_info file for the above analysis should look like MS.loc << Marker Description File MS.ped << Pedigree genotypes BOOT.out << File where results will be stored 0 << 0-non parametric(NPB) 400 << Number of bootstrap samples B .10 << coverage = 1-.1=.9 0 << Map function Haldane (0) 10 0 << # of points b/w markers=10 , offend distance=0 EXAMPLE 2: Compute a 95% CI for trait locus uing the PMB. The location of the trait locus and the IBD distribution will be provided by the researcher (55.7cM) and it will be set to (z_0,z_1,z_2)=(.1,.5,.4). For the maximazation use 7 points inbetween markers. The map description is in file MS.loc, and the genotypes are stored in file MS.ped. Use 200 boostrap replications to approximate emperical distribution. The BOOT_info file for the above analysis should look like MS.loc << Marker Description File MS.ped << Pedigree genotypes BOOT.out << File where results will be stored 1 << 1-Parametric(PMB) 200 << Number of bootstrap samples B .05 << coverage = 1-.05=.95 0 << Map function Haldane (0) 7 2 << # of points b/w markers=7 , offend distance=2cM 1 << 1-MLEs provided by researcher 55.7 << trait locus tau .1 .5 .4 << trait IBD distribution at above location ******************** Output Information ******************** This is a sample output of the BOOTS program: Maximum Likelihood Estimates Tau z0 z1 z2 72.93 0.00546 0.10538 0.88916 <<<<<<>>>>>> (This model was used for the PMB) Null Parameters (Fixed): Tau z0 z1 z2 55.70 0.10000 0.50000 0.40000 Sample Size (ASPs)= 250 Map Function: Haldane 95.00% Confidence Inervals (Based on 200 MC replicates) Param Mean SD LB UB tau 55.7000 -10.0000 51.6293 60.0350 p1 0.5000 -10.0000 0.4342 0.5000 p2 0.4000 -10.0000 0.3724 0.4655 MLEs for a collection of point across chromosome: Loc -2log(R) z0 z1 z2 12.50 8.89284 0.219634 0.439268 0.341099 12.88 9.31316 0.218439 0.436878 0.344683 13.25 9.72768 0.217295 0.434591 0.348114 . . . . . . . . . . . . . . . *********************************************************************** ******************** Example Files ******************************* *********************************************************************** The subdirectory CSI/examples contains the following files: mle_info : sample input file for the MLEs_IBD program BOOT_info : sample input file for the BOOTS program MS.ped : contains microsatellite genotypes from 1000 nuclear families on 16 markers MS.loc : Marker description file for the 16 markers CSI_info : sample input file for the CSI program SNP.ped : contains SNP genotypes from 1000 nuclear families on 41 SNPs SNP.loc : Marker description file for the 40 SNPs seedmorgan : file containing a random seed for the random number generator used by the CSI program. This file needs to be in the directory where we call the CSI/BOOTS program. Note: the microsatellite markers and the SNPs come from the same individuals, thus they can be used for the two-step CSI procedure described by Papachristou and Lin (in press).