The purpose of the program is to identify differentially expressed tags (genes) under different conditions from SAGE library data. This document contains instructions and examples for running reversible jump MCMC algorithm for SAGE library data analysis in Matlab 7.0 or above. The program can be used to the similar situations as SAGE library data, i.e. counts or natural number data from Binomial or Poisson distribution. Please refer to http://www.stat.ohio-state.edu/~statgen/DE-SAGE.html for reference paper and Matlab codes.

>> help('simuK2') Simulate K=2 SAGE data: [X, n0] = simuK2(pg, n0, N, seed) simulate SAGE data set from intensity pg generated from real data proportions or your own generated proportions. All input parameters have default values: pg=pg2, a two column matrix for which top 50 BW values for 2 group are different proportion. Others are same. n0=[3,3] means each group has 3 libraries. N = SX2*10, a vector with the length length(N) = sum(n0), gives the total counts for each library. seed=234 for random seed or state. SX2 is the vector containning the summation of each library counts for real data sagedata.txt. pg2 and SX2 will be downloaded to the same directory with the programs. -------------------------------------------------------------------------- Output arguments: X: simulated matrix with the same rows as pg and the same column as N, in addition, sum(X, 1) = N. n0: = input n0. The help message for K=3 groups is as follows: >> help('simuK3') Simulate K=3 SAGE data: [X, n0] = simuK3(pg, n0, N, seed) simulate SAGE data set from intensity pg generated from real data proportions or your own generated proportions. All input parameters have default values: pg=pg3, a three column matrix for which 100 values for 3 group are different proportion. Others are same. n0=[3,3,3] means each group has 3 libraries. N = [SX2*10, 30000, 30000, 30000], a vector with the length length(N) = sum(n0), gives the total counts for each library. seed=234 for random seed or state. SX2 is the vector containning the summation of each library counts for real data sagedata.txt. pg3 and SX2 will be downloaded to the same directory with the programs. -------------------------------------------------------------------------- Output arguments: X: simulated matrix with the same rows as pg and the same column as N, in addition, sum(X, 1)=N. n0: = input n0.

>> help('rjmcmcot') RJMCMC --Reversible Jump Markov Chain Monte Carlo used for identifying expressed genes for SAGE library data analysis. The data is in X with Gxn matrix where G is the number of total gene and n=sum(n0) is the number of individuals. The data has K>=2 groups by individuals where K=length(n0). n0 contains the library numbers for each group. This version considers move ``one'' gene after updating parameters. We assume p_{kg}~\beta(t\alpha_{kg}, t(\bar{N_k}-\alpha_{kg})), k=1,...,K. [z, Z] = rjmcmcot(X, n0, I, priorr, da, t, seed) calculate the posterior probability for each tag falling in the DE (differential expressed) set. All input arguments have default values. But if you give the first argument X, you have to input arguments n0 and I. X:(sagedata), Gxn matrix for SAGE library data. n0:([3,3]), 1xK vector gives the number of libraries for each froup. I:(80000), Total iteration numbers. priorr:(0.1), prior probability ratio for a tag in S1. da:(2), the parameter for updating alpha. t:(1), prior sensitivity parameter. seed:(1234), random seed or state. -------------------------------------------------------------------------- Output arguments: z: Gx3 matrix showing: column 1, tag index column 2, posterior probability column 3, Bayes Factor. Z: Ix5 matrix showing: column 1, iteration number column 2, the tag g being picked column 3, the group from which g is picked column 4, the group g is going to column 5, adding(1) or deleting(-1) or no moving(0). >> help('rjmcmcat') RJMCMC --Reversible Jump Markov Chain Monte Carlo used for identifying expressed genes for SAGE library data analysis. The data is in X with Gxn matrix where G is the number of total gene and n=sum(n0) is the number of individuals. The data has K>=2 groups by individuals where K=length(n0). n0 contains the library numbers for each group. This version considers move ALL tages after updating parameters. We assume p_{kg}~\beta(\alpha_{kg}, \bar{N_k}-\alpha_{kg}), k=1,...,K. [z, Z] = rjmcmcat(X, n0, I, priorr, da, t, seed) calculate the posterior probability for each tag falling in the DE (differential expressed) set. All input arguments have default values. But if you give the first argument X, you have to input arguments n0 and I. X:(sagedata), Gxn matrix for SAGE library data. n0:([3,3]), 1xK vector gives the number of libraries for each froup. I:(1000), Total iteration numbers. priorr:(0.1), prior probability ratio for a tag in S1. da:(2), the parameter for updating alpha. seed:(1234), random seed or state. -------------------------------------------------------------------------- Output arguments: z: Gx3 matrix showing: column 1, tag index column 2, posterior probability column 3, Bayes Factor. Z: (I+2)x(G+1) matrix showing: column 1, iteration number column 2--(G+1), the for G tags (1 column per tag) row 1, the tag index row 2, the group each tag falls in before iteration row 3--(I+2), the group each tag falls into after the corresponding iteration.Please note that each algorithm outputs an argument Z which records that in each step/iteration which tag moves from one group to the other one or keeps in the same groups. These results can be used for all remaining calculation and diagnostic tesing for the algorithms as descript in the manuscript.

>>simuK2 or >>rjmcmcot Or one could click "save and run" button if you open these files. >>[X, n0]=simuK2; will gives you the simulated data X and group information n0. If you have your pg: Gx2 matrix; no: 1x2 positive integer vector; and corresponding N; then >>[X, no]=simuK2(pg, n0, N); You need to change the seed for different simulated data set: >>X1 = simuK2(pg, n0, N, 123); >>X2 = simuK2(pg, n0, N, 456); Then X1 and X2 are different data sets simulated from the same distribution. >>[z, Z]=rjmcmcot; will give you the posterior probability as well as movement information for each iteration. For repeating simulated data results as in the manuscript: >>X1=load(datak2); >>n01=[3,3]; >>I1=80000; >>z1 = rjmcmcot(X1, n01, I1); >>X2=load(datak3); >>n02=[3,3,3]; >>z2 = rjmcmcot(X2, n02, I1); >>I2=1000; >>z3 = rjmcmcat(X1, n01, I2); >>z4 = rjmcmcat(X2, n02, I2); The differentially expressed tags can be identified in the file pg2 for K=2 and thefile pg3 for K=3 groups.

Zailong Wang Mathematical Biosciences Institute The Ohio State University 231 West 18th Avenue Columbus, OH 43210