BIOL 848: Phylogenetic Methods |
This lab was written by Paul O. Lewis and being used in BIOL 848 (with slight modifications) by Mark Holder with Paul's permission.
The goal of this lab exercise is to show you how to conduct maximum likelihood analyses in PAUP* using several models, and to decide among competing models using likelihood ratio testsTake a look at what was homework 6 the last time I taught the course. I'll assign you a similar homework, but with different characters, this year
Create a new file in PAUP* and enter the following text (note this is not exactly the same dataset as homework, so you'll need to modify this when you want to double check your homework).
#nexus begin paup; set storebrlens; end; begin data; dimensions ntax=4 nchar=2; format datatype=dna; matrix taxon1 AA taxon2 AC taxon3 CG taxon4 TT ; end; begin trees; utree anyTreeNameHere = (taxon1:0.1, taxon2:0.1, (taxon3:0.1, taxon4:0.1):0.3); end; begin paup; lset nst=1 basefreq=equal; lscores 1 / userbrlen sitelike; end;
The NEXUS file you just created has four blocks.
The first block is a paup block that sets the storebrlens flag. This tells PAUP* to save branch lengths found in any trees. By default, PAUP* throws away branch lengths that it finds and estimates the branch lengths. In this case, we are trying to get PAUP* to compute likelihoods for a tree in which all five branch lengths are set to the specific value 0.1, so it is important to get PAUP* to not discard branch lengths.
The second block is the data block. Data for two sites are provided, the first site is a data pattern that you would like to score. The second site is necessary because PAUP* will refuse to calculate the likelihood of a tree with data from only one site (and PAUP also wants a data matrix that has all four nucleotides). We will simply ignore results for the second (dummy) site.
The third block is a trees block that defines the tree and branch lengths.
The keyword utree can be used in PAUP* (but not necessarily other programs) to explicitly define an unrooted tree. The anyTreeNameHere part is just an arbitrary name for this tree: you could use any name there.
The fourth (paup) block comprises an lset command that specifies the likelihood settings. The nst option specifies the number of substitution parameters, which is 1 for the JC model, and basefreq=equal specifies that base frequencies are assumed to be equal. Together, nst=1 and basefreq=equal specify the JC model because the only other model with one substitution parameter is the F81 model (which has unequal base frequencies).
The command lscores 1 tells PAUP* to compute likelihood scores for the first tree in memory (which is the one we entered in this file). The keyword userbrlen tells PAUP* to use the branch lengths in the tree description (i.e. don't estimate branch lengths), and the sitelike keyword tells PAUP* to output the individual site likelihoods (the default behavior is to just output the overall likelihood).
Ok, go ahead and execute the file in PAUP*. Make sure that you understand the output. Can you think of a way to get PAUP* to give you the likelihood for a particular site and ancestral character state reconstruction?
In part A, we told PAUP* to use user-defined branch lengths and output site likelihoods whenever the lscores command was issued. PAUP* remembers these settings, and sometimes this causes unexpected results. You can cause PAUP* to forget these changes to default settings in one of two ways:
Issue the factory command now to cause PAUP* to revert to its factory default settings without having to quit and restart the program.
Download (again) the data file algae.nex. You may remember that we found last week that only one model (LogDet) gave us the accepted phylogeny for these data using various distance-based approaches (i.e. that the chlorophyll-a/b-containing plastids group together, excluding the cyanobacterium Anacystis and the chromophyte chlorophyll-a/c-containing Olithodiscus). This week we will see if we can tease apart which aspects of sequence evolution that are important for getting the tree correct.
The first goal is to learn how to obtain maximum likelihood estimates of the parameters in several different substitution models. Use PAUP* to answer the following questions. Start by obtaining the maximum likelihood tree under the F81 model. Create a run.nex file and save in it the following:
#nexus begin paup; execute algae.nex; set criterion=likelihood; lset nst=1 basefreq=empirical; hsearch; end;
The nst=1 tells PAUP* that we want a model having just one substitution rate parameter (the JC69 and F81 models both fall in this category). The basefreq=empirical tells PAUP* that we want to use simple estimates of the base frequencies. The empirical frequency of the base G, for example, is the value you would get if you simply counted up all the Gs in your entire data matrix and divided by the total number of nucleotides. The empirical frequencies are not usually the same as the maximum likelihood estimates (MLEs) of the base frequencies, but they are quick to calculate and often very close to the corresponding MLEs.
Execute run.nex in PAUP* and issue the following command to show the tree:
showtrees;
One problem is that the tree drawn in such a way that it appears to be rooted within the flowering plants (tobacco and rice). Specifying the cyanobacterium Anacystis as the outgroup makes more sense:
outgroup Anacystis_nidulans; showtrees;
Note that the branches are not drawn proportional to the expected number of substitutions. To fix this, use the describetrees command rather than the simpler showtrees command:
descr 1 / plot=phylogram;
As with all PAUP* commands, it is usually not necessary to type the entire command name, only enough letters that PAUP* can determined unambiguously which command you want. Here, you typed descr instead of describetrees, and it worked just fine.
Note that you will work with this tree for quite awhile. Resist the temptation to do heuristic searches with each model, as it will be important to compare the performance of all of the models on the same tree topology! To be safe, save this tree to a file named f81.tre using the savetrees command:
savetrees file=f81.tre brlens;
If you ever need to read this tree back in, use the gettrees command:
gettrees file=f81.tre;
Now get PAUP* to show you the maximum likelihood estimates for the parameters of the F81 model used in this analysis (the 1 here refers to tree 1 in memory):
lscores 1;
Now estimate the base frequencies on this tree with maximum likelihood as follows. Note how the lscores command is used to force PAUP* to recompute the likelihood (under the revised model) and spit out the parameter estimates.
lset basefreq=estimate; lscores 1;
Switch to the HKY85 model now and estimate the transition/transversion ratio along with the base frequencies. The way you specify the HKY model in PAUP* is to tell it you want a model with 2 substitution rate parameters (nst=2), and that you want to estimate the base frequencies (basefreq=estimate) and the transition/transversion ratio (tratio=estimated). Note that these specifications also apply to the F84 model, so if you want PAUP* to use the F84 model, you would need to add variant=f84 to the lset command.
lset nst=2 basefreq=estimate tratio=estimate; lscores 1;
Now ask PAUP* to estimate pinvar, the proportion of invariant sites, using the command lset pinvar=estimate. The HKY85 model with among-site rate heterogeneity modeled using the two-category invariant sites approach is called the HKY85+I model.
Now set pinvar=0 and tell PAUP* to use the discrete gamma distribution with 5 rate categories. Here are the commands for doing this all in one step:
lset pinvar=0 rates=gamma ncat=5 shape=estimate; lscores 1;
The HKY85 model with among-site rate heterogeneity modeled using the discrete gamma approach is called the HKY85+G model.
Now issue the command lset pinvar=estimate to create the HKY85+I+G model and ask PAUP* to estimate both pinvar and the gamma shape parameter simultaneously.
In this section, you will perform some simple likelihood ratio tests to decide which of the models used in the previous section does the best job of explaining the data while keeping the number of parameters used to a minimum. Remember, the likelihood ratio test is performed by doubling the difference in log-likelihood scores and comparing this test statistic with the critical value from a chi-squared distribution having degrees of freedom equal to the difference in the number of estimated parameters in the two models. Use this online chi-square calculator to determine the significance of the test.
The model with which we will begin is the F81 model with estimated base freqencies. Compare this F81 model to the HKY85 model, which differs from the F81 model only in the fact that it allows transitions and transversions to occur at different rates.
You should have all the numbers you need to perform these likelihood ratio tests. If, however, you have not written some of them down, and thus need to redo some of these analyses, you might need to know how to "turn off" rate heterogeneity using the following command:
lset rates=equal pinvar=0;
Consider the remaining models for which we have collected log likelihoods: HKY85+I, HKY85+G and HKY85+I+G.
Using the simplest model that you can defend (of the four we have examined), perform an heuristic search under the maximum likelihood criterion. To make the analysis go faster, we will ask PAUP* to not re-estimate all the model parameters for every tree it examines during the search. To do this, first use the lset command to set up the model you are planning to use. Use the lscores command to force PAUP* to re-estimate all of the parameters of your selected model on some tree (the tree just needs to be something reasonable, such as a NJ tree or the F81 tree you have been using). Now, for every parameter that you estimated, change the word estimate to previous in the lset command, and after executing this new lset command, start a search using just hsearch. PAUP* will fix the parameters at the previous values (i.e. the estimates you just forced it to calculate) and use these same values for every tree examined during the search.
This lab is a bit long, so we will not take time to do it now, but I hope you realize that you could figure out exactly what parameter(s) are needed in the model to get this tree right. JC69 doesn't do it, nor does F81 (as you may have noticed), but it actually doesn't take much beyond JC69 to do the trick.
The data file simdata.nex was generated under one of the following models: JC69, F81, K80, or HKY85.
All of the sites either evolved at the same rate, or rate heterogeneity was added in the form of gamma distributed relative rates with or without some invariant sites. Can you identify which of the four basic models was used, and in addition tell me how much rate heterogeneity was added?
Hint: start by getting a NJ tree and estimating all parameters of the most complex model (HKY85+I+G) on that tree. You should be able to tell by examining the parameter estimates which model was used. If it is not clear, you can perform some likelihood ratio tests to see how much significance should be attached to the value of the parameters in question.
To get the feeling for how many times you have to calculate the likelihood during the process of maximizing the likelihood I encourage you to use the command lset logiter in PAUP before performing an LScore command. First get a tree in memory, and specify the GTR+I+G model so that there are plenty of parameters to estimate.
When it is conducting the LScore, each line that starts with the word "parameters" corresponds to a new set of parameters in the model of character evolution. Each line that starts with "pass" is a pass over the tree that optimizes the length of each branch. The optimal value for a single branch length is found, and then PAUP moves to a neighboring branch. Because the lengths of one branch affect the maximum likelihood estimates of other branches (particularly branches that are close to it), PAUP has to pass over the entire tree multiple times before it converges to a set of branch lengths that do not change. If you score a large tree, then there can be a huge number of passes over the branch lengths mixed in with changes to the substitution parameters.
Approximating the ML score really does require a lot of computation.