This file contains instructions for running the CSI method as it is
described in Papachristou and Lin (Genet Epidemiol 2006; 30:3-17 and
Genet Epidemiol 2006; 30:18-29). It also contains two supplementary
programs (MLEs_IBD and BOOTS). MLEs_IBD estimates the MLE's of the IBD
sharing probabilities by two affected sibs, using the likelihood
method described by Risch (Am J Hum Genet 1990; 46:242-253). BOOTS
provides parametric and non-parametric bootstrap intervals for the
location of a putative trait locus based on the above likelihood
function.

These programs are written in C. However, CSI assumes that the
software environment R is available. This software package can be
freely downloaded from the following web page: 

http://www.r-project.org/

********* INSTALLATION *************

untar and unzip CSI.tar.z

> gzip -d CSI.tar.gz | tar xvf CSI.tar


This will produce a directory CSI/ in the current directory. CSI/ will
contain: 
1. README	: file (this document), 
2. src/		: directory (source codes),
3. lib/		: collection of routines 
4. example/	: directory (containing sample input files to run the
		  algorithms and ensure that they are compiled correctly), 
5. CSI	        : Pre-compiled CSI program fit for use in linux 
6. MLEs_IBD	: Pre-compiled MLE's program for linux environment
6. BOOTS	: Pre-compiled BOOTS program for linux environment


******** COMPILING ******************

If for any reason the pre-compiled versions do not run, you can
recompile the programs. First you need to compile the necessary
libraries. Descend to subdirectory CSI/lib and type the following
command

> make 

The above command will automatically create all libraries you need.

Next, move to the CSI/src directory and again type 

> make 

This command will automatically create new executable files for the
CSI and the MLEs_IBD and move them to parent directory (CSI/)

********** Getting Started *********

We first describe the input files for running CSI 
(See also files in directory "examples'

*********** INPUT FILES for running CSI ****************

CSI needs one input file to run (CSI_info). Following is a description
of the information contained in the in-file.

LINE 1: name of marker description file
LINE 2: name of pedigree file
LINE 3: name of file where results will be stored
LINE 4: the three IBD sharing probabilities at trait locus separated
        by space(s) or tabs
LINE 5: Map function used for conversion of recombination fractions 
	(0 Haldane, 1 Koshambi)
LINE 6:  "P S I O" - method of defining points where
	 analysis will be performed. if P=1, then analysis will be
	 performed on all markers and "S" locations in between
	 consecutive markers. If P=0, then analysis will be be
	 performed on all markers and as many points as needed
	 in between the markers, in increments of "I". Finally, "O"
	 controls how far off the ends of the map we should extend 
	 the analysis  
LINE 7:  CSI version (0 for original CSI), # MC replicates for
	 the multipoint analysis
LINE 8:  alpha = (1-level) of interval to be computed
LINE 9:  Maximum percentage of Markers allowed to be missing per person

** NOTE: additional information at the end of each line is ignored 
 
Example: Say that we are trying to construct a 85% confidence region
for a trait locus that we know the the IBD sharing distribution at the
trait locus is (z_0,z_1,z_2)=(.1,.4,.5). Let marker.loc be the
marker description file, ped_file.txt be the file with the sample
genotypes. Further, assume that we want to use the CSI-v3 to construct
the confidence region, and we want to use the scores at all markers
plus 9 locations between consecutive markers. We also want to use
10,000 MC replicates to estimate the necessary parameters under the
null, and want to store the results of the analysis in the file
out.txt. Finally, we want to include as if they were completely
genotyped, people who miss at most 10% of their marker genotypes. The
info file should look

marker.loc << name of marker description file
ped_file.txt << name of pedigree file
out.txt << File where results will be stored
.1 .4 .5 << IBD sharing probabilities at trait locus (z_0,z_1,z_2)
0 << Map function (0 Haldane, 1 Koshambi)
1 9 .5 0 << Points, Steps, Increment, offend
3 10000 << CSI version (0 for original CSI), MC replicates 
.15 << alpha
.10 << percent missing markers allowed

Note that everything after the symbols "<<", are just comments and
they will be ignored by the program.


********************** MARKER DESCRIPTION FILE *************************


Marker description file: marker.loc

This file provides description information on the marker data (allele
frequencies for each genetic marker, frequency and penetrance
information for the disease) and has exactly the same format as the
linkloci.dat file used by GENEHUNTER. The format of this file must be
identical to the Linkage parameter file (output from the PREPLINK
program). See the file MS.loc as an example of this file format or
consult Linkage documentation for further help.  The already compiled
version of the CSI can handle up to 1,000 markers at a time, with an
arbitrary number of alleles, and a total of 5,000 points. For larger
number of pairs the user may recompile the program changing the limits.


**********************PEDIGREE FILE*************************

Family Genotypes file: ped_file.dat

This file is identical to the pedigree file used in GENEHUNTER. The
pedigree should be in the Linkage pedigree input format (before
running MAKEPED or doing any preprocessing!). Each line of this file
must have the following structure:

     3    12   8   9   1   2   1      1 2   8 3   0 0   4 6   1 3 ...  4.10 0.374
    (a)   (b) (c) (d) (e) (f) (g)     (h ------------------------)     (i -------)

    (a)  pedigree name
    (b)  individual ID #
    (c)  father's ID #
    (d)  mother's ID #
    (e)  sex (1=MALE, 2=FEMALE)
    (f)  affectation status (1=UNAFFECTED, 2=AFFECTED)
    (g)  liability class (OPTIONAL) - classes specified in marker data file
    (h)  marker genotypes
    (i)  phenotype/covariate data (OPTIONAL)

    A 0 in any of the disease phenotype or marker genotype positions
    (as in the the genotypes for the third marker above) indicates
    missing data. See the file linkped.pre as an example.

    A - in the phenotype/covariate data indicates missing data - NB:
    0 is a real value that a phenotype may take on and DOES NOT represent
    missing phenotype data


The already compiled version of the CSI can handle up to 1,000 affected
sib pairs at once. For larger number of pairs the user may recompile
the program changing the limits.

******************** Output Information ********************

The CSI program outpouts several pieces of information on the 
output file:

Families:
 100 with 0 Parent(s)
 100 with 1 Parent(s)
 100 with 2 Parent(s)

Method: CSI-V3
 Locus    Obs.    mu     sigma  z_score P-value
  12.50  1.3435   1.737  0.0204  -19.23 0.000000
  14.38  1.3582   1.695  0.0218  -15.43 0.000000
  16.25  1.3750   1.682  0.0222  -13.82 0.000000
  18.13  1.3938   1.712  0.0218  -14.61 0.000000
    .	    .	    .       .       .       .
  42.50  1.7461   1.791  0.0200   -2.23 0.026059
  44.38  1.6854   1.721  0.0217   -1.64 0.101067
  46.25  1.6285   1.704  0.0218   -3.45 0.000570
    .	    .	    .       .       .       .
  87.50  1.1175   1.737  0.0203  -30.47 0.000000


95% Confidence region(s):
 36.791  42.273
 43.350  44.707

95% Confidence region(s) - Smoothed:
 36.791  42.273
 43.350  44.707
 

First it outputs the number of nuclear families used in the analysis,
This number reflects the total number of ASPs created after bigger
sib-sips were broken down.

Locus  : location where score computed. 
Obs.   : Observed mean IBD sharing by the ASP's in sample
mu     : Expect IBD sharing at locus under null hypothesis
sigma  : SD of IBD sharing at locus under null hypothesis
z_score: Standardized score
P_value: assuming normality.


Finally, the program provides two CR for the location of the trait
locus. The first is based on linear approximation of the missing
scores between consecutive loci in the analysis, while the second based
on a fitted smooth spline. 


The program also creates a file "smoothed_scores.eps" where it graphs
the standardized scores for locations in the analysis (dashed line) as
well as the smoothed spline fitted through these scores (solid
line). The two horizontal lines mark the two cutoff points +-z_a


******************** Running CSI *******************************

To run CSI, after compiling, simply type the following command

> CSI CSI_info


***********************************************************************
******************** MAXIMUM LIKELIHOOD *******************************
***********************************************************************
The MLEs_IBD program also takes one input file: mls_info. This file
contains the following information:


LINE 1: name of marker description file
LINE 2: name of pedigree file
LINE 3: name of file where results will be stored
LINE 4: Lower, upper locations where MLEs should be computed, as well
	as the increment for all the positions in between them	        
LINE 5: Map function used for conversion of recombination fractions 

** NOTE: additional information at the end of each line is ignored 


EXAMPLE: Compute the MLE's for the IBD probabilities starting from
location 50, ending in location 60, in increments of 2.5 cM. The marker
descriptions are in file MS.loc, and the genotypes are stored in file
MS.ped.

The mle_info file for the above analysis should look like

MS.loc << Marker Description File
MS.ped << Pedigree genotypes 
MLE.out << File where results will be stored
50 60 2.5 << lower limit, upper limit, increment
0 << Map conversion (0=Haldane, 1=Koshambi)


******************** Output Information ********************
This is a sample output of the MLEs_IBD program:

Sample Size= 300

 Loc(cM)            LOD            z0              z1              z2   
   50.00           43.84        0.023203        0.363983        0.612813        ****
   52.50           39.46        0.019953        0.365610        0.614437
   55.00           31.62        0.029037        0.399878        0.571085
   57.50           21.74        0.065955        0.450420        0.483626
   60.00           21.06        0.059320        0.443626        0.497054

Note: The four stars mark the location where the maximum occurred 

***********************************************************************
**************** Bootstrap Confidence Intervals  **********************
***********************************************************************
The  program also takes one input file: BOOT_info. This file
contains the following information:


LINE  1: name of marker description file
LINE  2: name of pedigree file
LINE  3: name of file where results will be stored
LINE  4: version of the bootstrap: 0-non parametric(NPB), 1-Parametric(PMB)
LINE  5: Number of bootstrap samples B used for constructing emperical dist.
LINE  6: alpha=1-Coverage probability of region
LINE  7: Map conversion recombination function  (0-Haldane,1-Kosambi) 
LINE  8: # of points b/w markers that MLEs will be computed, offend distance
LINE  9: PMB only: MLEs computed from current data(0) or fixed by researcher(1)
LINE 10: If previous line 1, location tau of trait locus
LINE 11: trait IBD distribution at above location(only for PMB).

** NOTE: additional information at the end of each line is ignored 
** NOTE: Lines 10-11 are ignored if line 9 is set 0, or line 4 is 0.
	 in addition line 9 is ignored if line 4 is 0.

EXAMPLE 1: Compute a 90% CI for trait locus uing the NPB. For the
maximazation use 10 points inbetween markers. The map description is
in file MS.loc, and the genotypes are stored in file MS.ped. Use 400
boostrap replications to approximate emperical distribution. 

The BOOT_info file for the above analysis should look like

MS.loc		<< Marker Description File
MS.ped		<< Pedigree genotypes 
BOOT.out	<< File where results will be stored
0		<< 0-non parametric(NPB)
400		<< Number of bootstrap samples B 
.10		<< coverage = 1-.1=.9   
0		<< Map function Haldane (0)
10 0		<< # of points b/w markers=10 , offend distance=0    


EXAMPLE 2: Compute a 95% CI for trait locus uing the PMB. The location
of the trait locus and the IBD distribution will be provided by the
researcher (55.7cM) and it will be set to (z_0,z_1,z_2)=(.1,.5,.4). For the
maximazation use 7 points inbetween markers. The map description is
in file MS.loc, and the genotypes are stored in file MS.ped. Use 200
boostrap replications to approximate emperical distribution. 

The BOOT_info file for the above analysis should look like

MS.loc	      << Marker Description File
MS.ped	      << Pedigree genotypes 
BOOT.out      << File where results will be stored
1	      << 1-Parametric(PMB)
200	      << Number of bootstrap samples B 
.05	      << coverage = 1-.05=.95   
0	      << Map function Haldane (0)
7 2	      << # of points b/w markers=7 , offend distance=2cM    
1	      << 1-MLEs provided by researcher
55.7	      << trait locus tau
.1 .5 .4      << trait IBD distribution at above location 


******************** Output Information ********************
This is a sample output of the BOOTS program:

Maximum Likelihood Estimates

Tau        z0      z1      z2
 72.93  0.00546 0.10538 0.88916

<<<<<<<Parametric Bootstrap>>>>>>> (This model was used for the PMB)
Null Parameters (Fixed):
Tau        z0      z1      z2
 55.70  0.10000 0.50000 0.40000


Sample Size (ASPs)= 250
Map Function: Haldane


95.00% Confidence Inervals (Based on 200 MC replicates)
Param      Mean             SD              LB              UB
 tau     55.7000        -10.0000         51.6293         60.0350
  p1      0.5000        -10.0000          0.4342          0.5000
  p2      0.4000        -10.0000          0.3724          0.4655

MLEs for a collection of point across chromosome:

  Loc      -2log(R)         z0              z1               z2
 12.50       8.89284     0.219634        0.439268        0.341099
 12.88       9.31316     0.218439        0.436878        0.344683
 13.25       9.72768     0.217295        0.434591        0.348114
 .	     .		 .		 .		 .
 .	     .		 .		 .		 .
 .	     .		 .		 .		 .

***********************************************************************
******************** Example Files *******************************
***********************************************************************
The subdirectory CSI/examples contains the following files: 

mle_info   : sample input file for the MLEs_IBD program
BOOT_info  : sample input file for the BOOTS program
MS.ped     : contains microsatellite genotypes from 1000 nuclear
	     families on 16 markers  
MS.loc     : Marker description file for the 16 markers

CSI_info   : sample input file for the CSI program
SNP.ped    : contains SNP genotypes from 1000 nuclear families on 41 SNPs
SNP.loc    : Marker description file for the 40 SNPs


seedmorgan : file containing a random seed for the random number
	     generator used by the CSI program. This file needs to be
	     in the directory where we call the CSI/BOOTS program.


Note: the microsatellite markers and the SNPs come from the same
individuals, thus they can be used for the two-step CSI procedure
described by Papachristou and Lin (in press).