Heritable-Clustering: Tumor progression pathway recapitulation, using DNA methylation data, clusters tumor patients, finds node centers and scores, and builds tumor progression pathways. ================================================================= Instructions ================================================================= This document contains instructions and examples for running the heritable clustering algorithm in Matlab 7.0 or above. Please refer to http://www.stat.ohio-state.edu/~statgen/SOFTWARE/Pathway/ for reference paper and Matlab codes. --------------------------- Download the data and codes --------------------------- From the website http://www.stat.ohio-state.edu/~statgen/SOFTWARE/Pathway, click on the "Software" link to download example data sets and Matlab codes into your directory, e.g., C:\Pathway. After the package has been downloaded and decompressed, open the Matlab command window and open the maincode.m file within the Matlab environment. Change the current directory to where you downloaded the codes, e.g., on Windows: >> cd('C:\Pathway') on Macs: >> cd('/Users/local/Pathway'). The program maincode.m can now be run by Matlab. As an alternative, copy the comands for maincode.m and past them to a Matlab command window. In this manner you will not have to change directories. -------------------- Data set description -------------------- The input data set can be saved as any format that can be imported into Matlab. An example of the expected layout of the data is given by in the file "BreastWork55.txt" and is described as follows: Columns 1-5 are phenotype measurements (p=5) and the other 9 columns are epigenotypic measurements (methylation data) for each patient. The input file name is hard-coded into the program; if your input file is given by a name other than "BreastWork55.txt", you must correct the name in the first row of the main code: >>X=load('BreastWork55.txt'); p=5; ------------------------------ Find the weight and similarity ------------------------------ The second line in maincode.m runs the command "wepsilon" which is used to find the best weight and similarity. As this information is necessary for constructing an optimal Heritable cluster, it is is saved to the file'table 1'. >>results=wepsilon(X,p); >>save table1 results -ascii Note that this is a computationally expensive operation and may take a considerable amount of time to complete. ------------------------------------- Clustering and updating clusters ------------------------------------- The fifth and sixth line call the command "HeritCluster" that will give you the cluster: C (in cell format), the result for the f value, and the similarity between any two individuals: Svector for given w and epsilon. >>w=0.7; epsilon=0.7; >>[C,result,Svector]=HeriCluster(X,p,w, epsilon); The commands "indexclus" and "clusindex" (see lines 9 and 23 respectively) are used to transfer the format of clusters between cell format and vector format. For example: >> load('C') % C is in the cell format >> C C = Columns 1 through 4 [1x26 double] [1x8 double] [1x11 double] [1x4 double] Columns 5 through 8 [1x6 double] [1x5 double] [1x7 double] [1x3 double] Columns 9 through 13 [1x5 double] [1x2 double] [67] [1x4 double] [1x2 double] Columns 14 through 15 [84] [86] % Transfer to vector format: >> idxc=indexclus(C) idxc = 1 3 2 4 3 5 4 1 5 3 6 3 7 3 8 3 9 1 10 1 11 1 12 1 13 1 14 4 15 1 16 1 17 1 18 6 19 3 20 7 21 1 22 1 23 2 24 7 25 6 26 5 27 5 28 1 29 1 30 2 31 8 32 2 33 1 34 9 35 9 36 5 37 3 38 1 39 5 40 7 41 1 42 1 43 6 44 2 45 9 46 8 47 10 48 1 49 6 50 1 51 1 52 3 53 3 54 3 55 4 56 7 57 3 58 7 59 1 60 1 61 1 62 1 63 1 64 9 65 4 66 7 67 11 68 9 69 2 70 5 71 7 72 1 73 2 74 2 75 8 76 10 77 6 78 12 79 12 80 13 81 12 82 12 83 13 84 14 85 2 86 15 % Transfer back to cell format: >>C1=clusindex(idxc) %Then C1 is the same as C. For updating clusters, one can use commands "clusupsimi" and "clusuplh" for similarity method and likelihood method respectively. Both give the updated clusters in vector format and in cell format. >>[idxcsi, Csi]=clusupsimi(C, Svector); >>[idxclhde, Clhde]=clusuplh(Y, idxc); Note 1: The command "kmeans02" is the revised version of kmeans so it can deal with the distance defined in our manuscript (i.e. weighted between phenotype and epigenotype distance). >>idxkm1=kmeans02(X,k,'dist','weightedpg','EmptyAction','singleton'); Note 2: For PAM and Hierarchical Clustering method, we use the function in R package. If you are working with a file other than the supplied "BreastWork55.txt" data set, you must run the script "Rcodes.R" using R package with appropriate changes in file and directory names. The silhouette values and Hierarchical Clusters outputed by "Rcodes.R" are used only as a means of comparing the results of the Heritable cluster algorithm and are not necessary for the implimentation of the program. The commands "likelic" and "entropyc" are used to calculate total likelihood and entropy for one cluster result respectively: >>LKC=-log(likelihc(Y,C)); >>NEC=entropyc(X,C); Note 3: Data set Y is calculate from X considering the dependence between ER/PR and histology as well as between Clinical stage and Metastasis status. ---------------------- Pathway Recapitulation ---------------------- The command "pathway" will provide you all information for building a pathway: current node; child node; genotype node center; and phenotype node center and score. For example: >>p=5; hepsi = 1; [cur, chi, NC, Pheno] = pathway(X,p,idxc,hepsi); nn=length(cur); for i=1:nn cur{i} end chi ans = %First generation 1 11 ans = %Second generation 6 7 3 4 14 8 9 12 ans = %Third generation 5 2 15 ans = %Fourth generation 13 10 chi = 1 1 6 % The node 6 is the progeny of generation 1 first node, i.e. node 1. 2 1 5 % The node 5 is the progeny of generation 2 first node, i.e. node 6. 1 1 7 % The node 7 is the progeny of generation 1 first node, i.e. node 1. 1 1 3 % The node 3 is the progeny of generation 1 first node, i.e. node 1. 1 1 4 % The node 4 is the progeny of generation 1 first node, i.e. node 1. 1 1 14 % The node 14 is the progeny of generation 1 first node, i.e. node 1. 1 1 8 % The node 8 is the progeny of generation 1 first node, i.e. node 1. 1 1 9 % The node 9 is the progeny of generation 1 first node, i.e. node 1. 3 1 13 % The node 13 is the progeny of generation 3 first node, i.e. node 5. 3 1 10 % The node 10 is the progeny of generation 3 first node, i.e. node 5. 2 4 2 % The node 2 is the progeny of generation 2 fourth node, i.e. node 4. 2 3 15 % The node 15 is the progeny of generation 2 third node, i.e. node 3. 1 1 12 % The node 12 is the progeny of generation 1 first node, i.e. node 1. ------------- Plot the tree ------------- The following commands are used to plot the trees in the manuscript and in supplementary materials: plotree_si; plotree_lh; plotree_km; plotree_pam; plotree_hcut; Then connect all nodes with arrow lines according to the information in the result of childnode from the corresponding "pathway" command. ----------------------------------- Missing data with likelihood method ----------------------------------- For missing data, the likelihood method can be used to update cluster. The command "clusuplhm" is created for this mission. The command "pathwaym" is used to recapitulate pathway with missing data. For example: [idxclhde, Clhde]=clusuplhm(Y, idxc, nmiss, 200); [curlhdem, chilhdem, NClhdem, Phenolhdem] = pathwaym(XT,p,idxclhdem,hepsi); Finally, "plotree_lhm" will build the tree from the results dealing with missing data.