Species Tree Estimation Tutorial for Phylogenomics Workshop 2019

In this tutorial, we will explore species tree inference, primarily using the SVDQuartets method. The theory behind SVDQuartets was developed in collaboration with Julia Chifman. Dave Swofford has implemented the method in PAUP*, and in doing so, has contributed greatly to its continued development in all aspects. Much of the material in this tutorial was developed by Dave in some form.

All of the files referenced in this tutorial are available at http://www.stat.osu.edu/~lkubatko/phylogenomics2019/.

Exercise 1: Analysis of the rattesnake data

The data set we will use for this exercise is the rattlesnake data of Kubatko et al. (Syst. Biol. 60(4): 393-409, 2011) that was used earlier today in the lecture. The data consist of 2 species, each divided into 3 subspecies, and an outgroup. There are 26 individuals (52 sequences) and 19 genes, for a total of 8,466 sites. The data can be downloaded in nexus format from www.stat.osu.edu/~lkubatko/phylogenomics2019/data-snakes.nex, by typing the following at the Unix prompt (inside whatever directory you'd like):

$ wget http://www.stat.osu.edu/~lkubatko/phylogenomics2019/data-snakes.nex
Once you have downloaded the data, open the nexus file and look at the contents. The first part of the file contains the sequence data, in the usual nexus format. Below that, you will see some information that is specific to the multilocus species tree framework. We will discuss this as a group.

Note that although the input format involves concatenating loci, SVDQuartets is NOT a concatenation method! In fact, SVDQuartets assumes that each site in the alignment has its own underlying gene tree, generated under the coalescent model from the species tree. Do not confuse the data format with the underlying model!

Start PAUP* by issuing the "paup" command at the unix command line:

$ paup

You will see that the prompt in your unix window has changed. Now load the data by executing the data-snakes.nex data file via the following command:

paup> exe data-snakes.nex;

We will analyze these data using SVDQuartets (or just SVDQ). First, let's look at a list of possible options, by using PAUP*'s help system. At the PAUP* command line, type:

paup> svdq ?

We'll begin by running a basic SVDQ analysis, using the following command;

paup> svdq partition=snakespecies;

Note that this analysis evaluates all possible quartets, and outputs the species tree inferred by SVDQuartets. Just above the species tree, some information regarding the analysis is given, such as the proportion of inferred quartets that are compatible with the species tree. This analysis should have run quickly.

Let's get some intuition about what SVDQ is doing. We'll repeat the analysis we just did, but this time we'll take advantage of the option to output quartet scores:

paup> svdq partition=snakespecies showScores=yes;

What you see in the output are the scores for all three possible quartets for each of the 68,465 possible sets of four taxa for this data. Recall from the lecture that the smaller the score, the more supported that quartet is, compared to the other two possibilities. Scan through the sets of the quartet scores, and notice how some sets of four taxa provide stronger evidence for a particular quartet relationship than others.

Next, we'll run a bootstrap analysis. The command for running the analysis with 100 bootstrap replicates and exhuastive quartet sampling is given below. You can change the random number seed to whatever you like. If you are running on your own computer, and it is slow, you may wish to reduce the number of bootstrap replicates and/or the number of quartets evaluated on each replicate for the sake of this exercise.

paup> svdq partition=snakespecies showScores=no seed=1234568 bootstrap nreps=100;
Questions:

Now we'll consider constructing a lineage tree, a tree relating the individual sequences. Although this tree can be difficult to interpret, it may be helpful in exploring your data, finding mis-identified sequences, etc.

paup> svdq partition=none seed=475839 bootstrap nreps=100;

Exercise 2: Exploring the anomaly zone

During this morning's lecture, we discussed the anomaly zone, which refers to a set of species trees for which the gene tree with the highest probability under the coalescent model does not agree in topology with the species tree. Liu and Edwards (2009) studied the behavior of various species tree reconstruction methods for data generated from a species tree in the anomaly zone. Here we will replicate part of their study.

Look carefully at the tree in Figure 2(a) of their paper (available here). Now consider the following PAUP* code, which is also available here (you can issue the command "wget http://www.stat.osu.edu/~lkubatko/phylogenomics2019/anomaly_zone.nex" from the unix prompt to get it).


#NEXUS

begin taxa;
    dimensions ntax=5;
    taxlabels A B C D E;
end;

begin trees;
    tree 1 = [&R] ((((A:0.05,B:0.05):0.005,C:0.055):0.005,D:0.06):1.0,E:1.06);
end;

begin paup;
    showtrees/userbrlens;
    end;

begin dnasim;
    simdata multilocus=y nloci=5000 nsitesperlocus=500;
    truetree source=memory treenum=1  units=2Ngen
             scalebrlen=10.0
             mscoal=y Ne=10000000 mu=0.25e-8
             showtruetree=brlens showgenetrees=n storetruetrees=n
             seed=1;
    lset model=jc nst=1 basefreq=eq;    [= Jukes-Cantor model]
    beginsim nreps=1 seed=22;
        outgroup 5;
        rootTrees rootMethod=outgroup;
        svdq nthreads=2;
        tally 'svdq';
        showtrees;
        set criterion=likelihood;
        hsearch;
        tally 'hsearch';
        showtrees;
    endsim;
end;

We'll discuss the various options in this file as a group. Then, you can go ahead and run the analysis. Open PAUP* and use the command:

paup> exe anomaly_zone.nex

Look at the output and compare the estimated trees using SVDQuartets and maximum likelihood to the tree in Figure 2(a) of Liu and Edwards (2009).

Optional: Once you understand what the code above is doing, we can think of replicating some of the results in Figure 3 of Liu and Edwards (2009). Change the number of reps to something larger (but beware: if you run too many reps, it will take a long time to run!). You will also want to remove the "showTrees" commands to save time. You can also try out some other methods of tree inference, such as neighbor-joining and parsimony. Code for all of this is available in the file www.stat.osu.edu/~lkubatko/phylogenomics2019/anomaly_zone_sim.nex.

Exercise 3: Class-level simulation study

In this exercise, we will explore the performance of SVDQuartets. We'll consider three factors: the number of loci, the length of each locus, and the scaling factor for the branch lengths. We'll consider a 6-taxon model tree, as given in the nexus file www.stat.osu.edu/~lkubatko/phylogenomics2019/sim-6tax.nex. When you look at the file, you will notice that the character @ appears in the three places corresponding to the three values we'll change. You will need to edit your file as follows:

We will consider a varying number of loci (200, 500, 1000, 3000, 5000), and we will consider three values for the branch length scaling factor: 0.1, 1.0, and 10.0. You can run all of the different locus values in one run, as shown below. The file below is for 200 sites per locus (since I'm an "L") with the banch length scaling factor set to 1.0:

#NEXUS

begin taxa;
    dimensions ntax=6;
    taxlabels A B C D E F;
end;

begin trees;
    tree 1 = [&R] (A:4.2,((B:1.0,C:1.0):2.2,(D:3.1,(E:3.0,F:3.0):0.1):0.1):1.0);
end;

begin paup;
    showtrees/userbrlens;
end;

begin dnasim;
    simdata multilocus=y nloci=(200 500 1000 3000 5000) nsitesperlocus=200;
    truetree source=memory treenum=1  units=2Ngen
             scalebrlen=1.0
             mscoal=y Ne=100000 mu=1e-8
             showtruetree=brlens showgenetrees=n storetruetrees=n
             seed=1;
    lset model=jc nst=1 basefreq=eq;    [= Jukes-Cantor model]
    beginsim nreps=100 seed=0;
        svdq nthreads=2;
        tally 'svdq';
    endsim;
end;

When you are done editing the file and are ready to run the analysis, use the following command:

paup> exe sim-6tax.nex

You will each do three separate runs, one with each choice of branch length scaling factor. When each run finishes, enter its results in this Google spreadsheet: https://docs.google.com/spreadsheets/d/1bHYJZ-xzYFh8JeWWceELhRczKQTNHbLA5rJmHrjZJiY/edit?usp=sharing . The graphs will dynamically update (hopefully!), so you can watch the progress made by the class while you wait for your runs to finish (you may want to copy your results somewhere and get the next run started before you enter the data into the spreadsheet). We will come together as a group to discuss the results if there is time.

References

Chifman J., Kubatko L.S. 2014. Quartet inference from SNP data under the coalescent model. 30:3317–3324.

Chifman J., Kubatko L.S. 2015. Identifiability of the unrooted species tree topology under the coalescent model with time-reversible substitution processes, site-specific rate variation, and invariable sites. J Theor Biol. 374:35–47.

Kubatko L.S., Degnan J.H. 2007. Inconsistency of phylogenetic estimates from concatenated data under coalescence. Syst Biol. 56:17–24.

Kubatko, L., H.L. Gibbs, and E.W. Bloomquist. 2011. Inferring species-level phylogenies and taxonomic distinctiveness using multi-locus data in Sistrurus rattlesnakes, Systematic Biology,60(4): 393-409.

Liu L., Edwards S.V. 2009. Phylogenetic analysis in the anomaly zone. Syst Biol. 58:452–460.

Roch, S. and M. Steel. 2015. Likelihood-based tree reconstruction on a concatenation of aligned sequence data sets can be statistically inconsistent. Theoret. Pop. Biol. 100:56-62.