Reversible Jump Markov Chain Monte Carlo (RJMCMC) for SAGE library data analysis
INSTRUCTIONS
The purpose of the program is to
identify differentially expressed tags (genes) under different conditions from SAGE library data.
This document contains instructions and examples for running reversible jump MCMC algorithm for SAGE library data analysis in Matlab 7.0 or above.
The program can be used to the similar situations as SAGE library data, i.e. counts or natural number data from Binomial or Poisson distribution.
Please refer to http://www.stat.ohio-state.edu/~statgen/DE-SAGE.html for reference paper and Matlab codes.
Download the data and codes
>From website http://www.stat.ohio-state.edu/~statgen/SOFTWARE/RJMCMC to download
example data sets and Matlab codes to your directory say
C:\downloaded\RJMCMC. Basically, you have downloaded three data sets and four main Matlab program codes (functions). Three data sets are dataK2
(simulated 2 group), dataK3 (simulated 3 group) and sagedata.txt (real data). Please see "Data set description" and "Data simulation" for the
details of these data set. Four main programs (functions) are: simuK2, simuK3, rjmcmcot, and rjmcmcat. For getting help for these functions,
type help('function-name') in the Matlab command window after you have changed the directory to where you downloded the codes. The help message
are also given in this instruction.
Data set description
The data set can be prepared by Excel or other format that can be imported to Matlab. An example of
of the format of data set called
"sagedata.txt" is described as follows. The sagedata.txt is a 596x6 matrix with 6 libraries and 596 tag counts for each library. This is the
real SAGE data we used in our manuscript. It is also the default data argument for Matlab commands: rjmcmcot and rjmcmcat. This data will be
separated into two groups: adult and aged group. The First three columns consist of the adult group and the last three columns consist of the
aged one. So the second parameter for these commands is n0=[3, 3] which is also set to be the default vector for argument n0. The tag index will
be automatically settled as 1:596 corresponding to the 596 rows.
Data simulation
One can simulate the SAGE data by using downloaded Matlab functions: simuK2 and simuK3, for K=2 groups and K=3 groups simulated data respectively
as in our manuscript. The following help message gives you details of function simuK2. It is a copy from the Matlab command window by type
>> help('simuK2')
>> help('simuK2')
Simulate K=2 SAGE data:
[X, n0] = simuK2(pg, n0, N, seed) simulate SAGE data set from intensity pg
generated from real data proportions or your own generated proportions.
All input parameters have default values:
pg=pg2, a two column matrix for which top 50 BW values for 2 group
are different proportion. Others are same.
n0=[3,3] means each group has 3 libraries.
N = SX2*10, a vector with the length length(N) = sum(n0), gives
the total counts for each library.
seed=234 for random seed or state.
SX2 is the vector containning the summation of each library counts for
real data sagedata.txt. pg2 and SX2 will be downloaded to the same
directory with the programs.
--------------------------------------------------------------------------
Output arguments:
X: simulated matrix with the same rows as pg and the same column as N,
in addition, sum(X, 1) = N.
n0: = input n0.
The help message for K=3 groups is as follows:
>> help('simuK3')
Simulate K=3 SAGE data:
[X, n0] = simuK3(pg, n0, N, seed) simulate SAGE data set from intensity pg
generated from real data proportions or your own generated proportions.
All input parameters have default values:
pg=pg3, a three column matrix for which 100 values for 3 group
are different proportion. Others are same.
n0=[3,3,3] means each group has 3 libraries.
N = [SX2*10, 30000, 30000, 30000], a vector with the length
length(N) = sum(n0), gives the total counts for each library.
seed=234 for random seed or state.
SX2 is the vector containning the summation of each library counts for
real data sagedata.txt. pg3 and SX2 will be downloaded to the same
directory with the programs.
--------------------------------------------------------------------------
Output arguments:
X: simulated matrix with the same rows as pg and the same column as N,
in addition, sum(X, 1)=N.
n0: = input n0.
Reversible Jump MCMC functions
There are two functions: rjmcmcot and rjmcmcat, corresponding to two algorithms OT and AT respectively. As described in our manuscript,
"OT" means pciking "One Tag" at each iteration to check its potential movement between deferentially expressed tag set (DE) and similarly
expressed tag set (SE), while "AT" means checking "All Tags" per iteration for their potential movement. The help messages for these two
functions give the detailed descriptions.
>> help('rjmcmcot')
RJMCMC --Reversible Jump Markov Chain Monte Carlo used for identifying
expressed genes for SAGE library data analysis.
The data is in X with Gxn matrix where G is the number of total
gene and n=sum(n0) is the number of individuals.
The data has K>=2 groups by individuals where K=length(n0).
n0 contains the library numbers for each group.
This version considers move ``one'' gene after updating parameters.
We assume p_{kg}~\beta(t\alpha_{kg}, t(\bar{N_k}-\alpha_{kg})),
k=1,...,K.
[z, Z] = rjmcmcot(X, n0, I, priorr, da, t, seed) calculate the
posterior probability for each tag falling in the DE (differential
expressed) set. All input arguments have default values. But if you give
the first argument X, you have to input arguments n0 and I.
X:(sagedata), Gxn matrix for SAGE library data.
n0:([3,3]), 1xK vector gives the number of libraries for each froup.
I:(80000), Total iteration numbers.
priorr:(0.1), prior probability ratio for a tag in S1.
da:(2), the parameter for updating alpha.
t:(1), prior sensitivity parameter.
seed:(1234), random seed or state.
--------------------------------------------------------------------------
Output arguments:
z: Gx3 matrix showing: column 1, tag index
column 2, posterior probability
column 3, Bayes Factor.
Z: Ix5 matrix showing: column 1, iteration number
column 2, the tag g being picked
column 3, the group from which g is picked
column 4, the group g is going to
column 5, adding(1) or deleting(-1) or no moving(0).
>> help('rjmcmcat')
RJMCMC --Reversible Jump Markov Chain Monte Carlo used for identifying
expressed genes for SAGE library data analysis.
The data is in X with Gxn matrix where G is the number of total
gene and n=sum(n0) is the number of individuals.
The data has K>=2 groups by individuals where K=length(n0).
n0 contains the library numbers for each group.
This version considers move ALL tages after updating parameters.
We assume p_{kg}~\beta(\alpha_{kg}, \bar{N_k}-\alpha_{kg}), k=1,...,K.
[z, Z] = rjmcmcat(X, n0, I, priorr, da, t, seed) calculate the
posterior probability for each tag falling in the DE (differential
expressed) set. All input arguments have default values. But if you give
the first argument X, you have to input arguments n0 and I.
X:(sagedata), Gxn matrix for SAGE library data.
n0:([3,3]), 1xK vector gives the number of libraries for each froup.
I:(1000), Total iteration numbers.
priorr:(0.1), prior probability ratio for a tag in S1.
da:(2), the parameter for updating alpha.
seed:(1234), random seed or state.
--------------------------------------------------------------------------
Output arguments:
z: Gx3 matrix showing: column 1, tag index
column 2, posterior probability
column 3, Bayes Factor.
Z: (I+2)x(G+1) matrix showing: column 1, iteration number
column 2--(G+1), the for G tags (1 column per tag)
row 1, the tag index
row 2, the group each tag falls in before iteration
row 3--(I+2), the group each tag falls into after
the corresponding iteration.
Please note that each algorithm outputs an argument Z which records that in each step/iteration which tag moves from one group to the other one
or keeps in the same groups. These results can be used for all remaining calculation and diagnostic tesing for the algorithms as descript in the
manuscript.
Examples
Run these functions in your directory by typing the name after the Matlab prompt in Matlab Command Window >>, i.e.
>>simuK2
or
>>rjmcmcot
Or one could click "save and run" button if you open these files.
>>[X, n0]=simuK2;
will
gives you the simulated data X and group information n0.
If you have your pg: Gx2 matrix; no: 1x2 positive integer vector; and corresponding N; then
>>[X, no]=simuK2(pg, n0, N);
You need to change the seed for different simulated data set:
>>X1 = simuK2(pg, n0, N, 123);
>>X2 = simuK2(pg, n0, N, 456);
Then X1 and X2 are different data sets simulated from the same distribution.
>>[z, Z]=rjmcmcot;
will give you the posterior probability as well as movement information for each iteration.
For repeating simulated data results as in the manuscript:
>>X1=load(datak2);
>>n01=[3,3];
>>I1=80000;
>>z1 = rjmcmcot(X1, n01, I1);
>>X2=load(datak3);
>>n02=[3,3,3];
>>z2 = rjmcmcot(X2, n02, I1);
>>I2=1000;
>>z3 = rjmcmcat(X1, n01, I2);
>>z4 = rjmcmcat(X2, n02, I2);
The differentially expressed tags can be identified in the file pg2 for K=2 and thefile pg3 for K=3 groups.
Contact information
For any questions about these algorithms and programs, please contact Zailong Wang by zlwang@mbi.osu.edu
Zailong Wang
Mathematical Biosciences Institute
The Ohio State University
231 West 18th Avenue
Columbus, OH 43210