BIOL 848: Phylogenetic Methods

This lab was written by Paul O. Lewis and being used in BIOL 848 (with slight modifications) by Mark Holder with Paul's permission (the original lab is here).

Lab 3: Parsimony and distance searching using PAUP.

Contents

Part A: Searching under the parsimony criterion

  1. Create the data file. Download the file from http://phylo.bio.ku.edu/slides/data/angio35b.nex and save it on the machine that you are working on.
  2. Create a command file. Create a blank file, then type in the following text, saving this as run.nex:
    #NEXUS
    begin paup;
        log file=output.txt start append;
        execute angio35b.nex;
    end;
    

    in the same directory that holds the angio35b.nex file that you downloaded
  3. Execute run.nex, which will in turn execute angio35b.nex. There at least a couple of advantages to creating little NEXUS files like run.nex. For now, the only advantage is that executing run.nex automatically starts a log file so that you will have a record of what you did. Later, when you get in the habit of putting commands in paup blocks, you will appreciate the separation of the data from the commands that initiate analyses (I have many times opened a data file, forgetting about the embedded paup block that then starts a long search, overwrites my previous log file, and otherwise creates havoc).
  4. Delete all taxa except the first five. Using this command delete 6-. will cause PAUP* to ignore all taxa except Ephedrasinica, Gnetum_gnemJS, WelwitschiaJS, Ginkgo_biloba, and Pinus_ellCH.
  5. Perform an exhaustive search using parsimony. Use the alltrees fd=bar command for this. This should go fast because you now have only 5 taxa.
    • How many separate tree topologies did PAUP* examine?
    • What is the parsimony treelength of the best tree? The worst tree?
    • How many steps separate the best tree from the next best?
  6. Perform an heuristic search using NNI branch swapping. Before you start, use the describe command to show you the tree obtained from the exhaustive enumeration.
    • Draw this tree on a piece of paper and then draw the 4 possible NNI rearrangements
  7. Find all NNI rearrangements of the best tree. Note that because we performed an exhaustive enumeration, we now know which tree is the globally most parsimonious tree. We are thus guaranteed to never find a better tree were we to start an heuristic search with this tree. Let's do an experiment: perform an NNI heuristic search, starting with the best tree, and have PAUP* save all the trees it encounters in this search. In the end, PAUP* will have in memory 5 trees: the starting tree and the 4 trees corresponding to all possible NNI rearrangements of that starting tree: hsearch start=1 swap=nni nbest=15;
    • start=1 starts the search from the tree currently in memory (i.e., the best tree resulting from your exhaustive search using the parsimony criterion)
    • swap=nni causes the Nearest-Neighbor Interchange (NNI) method to be used for branch swapping
    • nbest=15 saves the 15 best trees found during the search. Thus, were PAUP* to examine every possible tree, we would end up saving all of them in memory. The reason this command is needed is that PAUP* ordinarily does not save trees that are worse than the best one it has seen thus far. Here, we are interested in seeing the trees that are examined during the course of the search, even if they are not as good as the starting tree.
  8. Show all 5 trees in memory. Use the describe all command to plot the 5 trees currently in memory. The reason we are using the describe command rather than the showtrees command is because we want PAUP* to show us the numbers it has assigned to the internal nodes, something that showtrees doesn't do.
    • Which tree was the original tree?
    • Which trees correspond to NNI rearrangments of which internal edges on the original tree?
  9. Find the most parsimonious tree for all 35 taxa. Quit Paup. Relaunch it and reexecute run.nex to restor all taxa, then conduct a heuristic search having the following characteristics:
    • The starting trees are each generated by the stepwise addition method, using random addition of sequences
    • Swap using NNI branch swapping
    • Reset the nbest option to all because we want to be saving just the best trees, not suboptimal trees (yes, this option is a little confusing).
    • Set the random number seed to 5555 (this determines the sequence of pseudorandom numbers used for the random additions; ordinarily you would not need to set the random number seed, but we will do this here to ensure that we all get the same results)
    • Do 500 replicate searches; each replicate represents an independent search starting from a different random-addition tree
      Here is the full command implementing this search: hsearch start=stepwise addseq=random swap=nni nbest=all rseed=5555 nreps=500;
    • How many tree islands were found?
    • How long did the search take?
    • How many rearrangements were tried?
  10. Conduct a second search with SPR swapping. Be sure to reset the random number seed to 5555. You should be able to figure out how to do this using the output from hsearch ?. Note that to save typing you can call up previously entered commands using the little buttons on the right of the command line edit control.
    • How many tree islands were found?
    • What are the scores of the trees in each island?
    • How long did the search take this time?
    • How many rearrangements were tried?
  11. Now conduct a third search with TBR swapping.
    • How many tree islands were found?
    • What are the scores of the trees in each island?
    • How long did the search take this time?
    • How many rearrangements were tried?
    • How many trees are currently in memory (use the treeinfo command)?
    • Has PAUP* saved trees from all islands discovered during this search? (Hint: compare "Number of trees retained" to the sum of the "Size" column in the Tree-island profile.) Explain why PAUP* saved the number of trees it did.

      Wondering about that warning "Multiple hits on islands of unsaved trees may in fact represent different islands"? When PAUP* encounters a new island, it will find all trees composing that particular island in the process of branch swapping. Thus, if (in a new search) it encounters any trees already stored in memory, it knows that it has hit an island that it found previously. Note that it would be pointless to continue on this tack, because it will only find all the trees on that island again. For trees retained in memory, PAUP* can keep track of which island they belong to (remember that it is possible for trees with the same parsimony score to be in different tree islands!). But for trees that are not retained in memory, PAUP* only knows that it has encountered an island of trees having score X; it has no way of finding out how many islands are actually represented amongst the trees having score X.
    • Of the three types of branch swapping (NNI, SPR, TBR), which is the most thorough? Least thorough?
    • How many TBR rearrangements can PAUP* examine on the computer you are using in one second (on average)?
    • Based on this, how long would it take to examine all possible trees if only 20 sequences were included?
Feel free to look at the NEXUS datasets that cause multiple NNI islands (here) and the cause star decomposition to fail even in the absence of homoplasy (here).

Part A: Basic distance analyses in PAUP

Modify your PAUP* command file

Use a text editor to edit the run.nex in the same directory as the angio35b.nex file. This new file will be a NEXUS file that contains the PAUP block with the commands for PAUP. Here are the commands:

#NEXUS

begin paup;
  execute angio35b.nex;
  dset dist=abs;
  delete 5-.;
  exclude missambig;  
  showdist;
end;

Type (or copy these commands in the file and save it as plain text -- not RTF).

The file run.nex tells PAUP* to:

Save the run.nex file. Then execute it PAUP*, and examine the output.

Calculate p distances and JC distances

To see the distances as a proportion of sites that differ for the sequences just change the distance measure from abs to p and re-execute run.nex.

Examine the resulting data matrix, change the distance correction in run.nex from p to jc to tell PAUP to use the Jukes-Cantor distance. Re-execute the file.

Execute the dset ? command to see all of the distance settings that are available in PAUP. If you are confused by an option, you can check it out by downloading the PAUP manual from here (or by asking me about an option).

Estimate edge lengths using weighted least squares

Next, perform a search using weighted least squares (weighted usually implies that the power is 2 in the denominator of the sum-of-squares formula). Add the following line to your run.nex file, just below the execute angiob35.nex; line:

set criterion=distance ;

This command tells PAUP* to use the distance optimality criterion specified by the objective option of the dset command during the search. If you were to leave this out, PAUP* would use the default optimality criterion (parsimony). Now issue the command dset ? in PAUP* to get a listing of the current values of all distance settings.

If your answer is not 2, then add this line to your paup block, below the execute command:

dset power=2;

If your answer is not lsfit, then add this line to your paup block, also below the execute command:

dset objective=lsfit;

Finally, add the following two lines to the end of your paup block to perform the search and show the resulting best tree, including the estimated branch lengths:

alltrees;
describe 1 / brlens;

Save and re-execute the file.

Searching under the minimum evolution criterion

Before moving on to the next exercise, repeat the above search using the minimum evolution (or ME) criterion. To do this, you will need to add the objective=ME option to your dset command (be sure to remove the previous dset objective=lsfit if you had to put it in) and re-execute run.nex.

Part B: Analysis of algae.nex

Create the data file algae.nex

Click http://phylo.bio.ku.edu/slides/data/algae.nex to get the data file. Save it as algae.nex in a new folder on your local hard drive. These data were originally used in a 1994 study by Lockhart et al.[1] and comprise eight 16S ribosomal RNA sequences:

Anacystis a cyanobacterium (has chlorophyll a but not b or c)
Olithodiscus a chloroplast from a chromophyte alga (chlorophylls a and c)
Euglena a chloroplast from a photosynthetic euglenophyte protist (chlorophylls a and b)
Chlorella a chloroplast from a a chlorophyte green alga (chlorophylls a and b)
Chlamydomonas a chloroplast from a chlorophyte green alga (chlorophylls a and b)
Marchantia a chloroplast from a thallose liverwort (chlorophylls a and b)
Oryza a chloroplast from a monocot (chlorophylls a and b)
Nicotiana a chloroplast from a dicot (chlorophylls a and b)

All of these organisms except Anacystis and Olithodiscus have chlorophylls a and b. Based on independent evidence, it is probable that all chlorophyll a/b-containing chloroplasts have a common endosymbiotic origin, so we would expect trees constructed from these data to show a branch separating Anacystis and Olithodiscus from everything else. The cyanobacterium Anacystis uses phycobilin accessory pigments rather than chlorophylls for photosynthesis, and the chromophyte alga Olithodiscus has chlorophylls a and c (but not b).

Create a PAUP* command file.

As you did above create a new run.nex file to serve as a PAUP command file. Copy the following text and paste it into the new text file window, then save this file as runalgae.nex, placing it in the same directory as algae.nex:
#nexus

begin paup;
  log file=lab3.txt start replace;
  set torder=right;
  execute algae.nex;
  outgroup Anacystis_nidulans;
  set criterion=distance;
  
  [!
  ****************************
  ** JC Model, ME Criterion **
  ****************************
  ]
  dset distance=jc objective=me;
  alltrees;
  describe 1;
  
  [other commands here]
  log stop;
end;

During the course of this lab, you can simply add commands to this paup block rather than creating a paup block in the data file itself. The set torder=right command simply causes trees to be displayed so that the outgroup is at the top and the tree appears to flow to the right (this is often called ladderizing right). Try changing this to set torder=left to see what ladderizing left looks like.

Perform an exhaustive search under the minimum evolution and least squares criteria

Use both the JC and logdet models combined with both Minimum Evolution and Least Squares (using the default power of 2) optimality criteria. Use dset dist=? to find out how to specify the model, and use dset objective=? to find out how to specify which of the two distance-based optimality criteria to use. Set up all 4 analyses (2 models times 2 optimality criteria) in the paup block (the first one has already been done for you), then run them all at once by executing the file. Note the printable comment (it starts with an exclamation point). Making comments like this in your paup block will allow you to easily find where the results from each model start in the output.

The main point here is that the model used can make a difference in whether you get the correct answer. The logdet model has not yet been discussed: a bit more background on models is needed before an explanation of the logdet model will make sense. For now, just know that the logdet model is one of the few models in common use that is not tricked by nucleotide composition that changes across the tree. Models like JC that assume nucleotide composition is constant tend to group taxa with similar nucleotide composition (e.g. GC-rich sequences) even if they are not closest relatives. Logdet is not so easily fooled in such circumstances, but can be tricked by other features of sequence evolution (such as among-site rate heterogeneity).

Perform a bootstrap analysis

We will cover bootstrapping and majority-rule consensus trees later in the course. But here is some brief background info:

Bootstrapping (introduced to phylogenetics by Joe Felsenstein in 1985[2]) is one of the most common methods for assessing which parts (i.e. edges) of an estimated tree are best supported and which are poorly supported by the data. In bootstrapping, many new datasets are created by sampling (with replacement) characters from the original dataset. A tree is estimated from each bootstrap replicate dataset, and at the end a consensus tree is produced that has in it all splits showing up in a majority of the bootstrap trees. The idea is that the original dataset can be thought of as one sample of characters from a vast pool of characters, and producing other data sets by resampling comes as close as possible to collecting data from other genes similar to the one for which you did collect data. The majority-rule consensus tree does not include splits that were present in fewer than half of the trees estimated from bootstrap replicate data sets, but the bipartition table that is produced by PAUP* provides the relative frequency of these less common splits.

Perform a bootstrap analysis (500 replicates, heuristic search using least squares) under the F84 model. The commands for doing this are shown below. I suggest you copy these into a new paup block in your runalgae.nex file. Be sure to disable your previous paup block by renaming it; if you change the name of a paup block slightly, for example, changing begin paup; to begin _paup, PAUP* will skip it when executing the file.
dset distance=f84 objective=lsfit power=2;
bootstrap nrep=500 search=heuristic bseed=5555;

Sometimes a split will just barely fail to make the cut (e.g. it appeared in 49% of the bootstrap trees), and so it is a good idea to check the bipartition table to make sure you don't fail to notice these. In this case, the F84 model behaves much like JC, putting the chlorophyll a/c containing Olithodiscus inside the chlorophyll a/b clade. You should have found that the bootstrap support for the split separating the chlorophyll a/b clade from the remaining two taxa is 31.4% (i.e. not very close to the 50% mark required for inclusion in the majority-rule consensus tree).

Compare NJ with a comparable heuristic search

In this exercise, you will conduct a Neighbor-joining (NJ) analysis using HKY distances and compare this with an heuristic search using the minimum evolution criterion. The goal of this section is to demonstrate that it is possible for heuristic searching to find a better tree than NJ, even using the same optimality criterion.

Please quit PAUP* and start it again. The reason for this will be made clear later, but mainly the purpose is to return all settings to their default values.

Put the commands below in a paup block in a new file. Note that we are again using the angio35.nex file:

execute angio35.nex;
dset distance=hky objective=me;
nj;
dscores 1;
hsearch start=1;
dscores 1;

Once you have figured out what is going on (ask us for help if you are stumped), fix your paup block and re-execute the file.

In your reanalysis, you should find that the heuristic search starting with the NJ tree found a better tree using the same optimality criterion (minimum evolution) being used by NJ. Neighbor-joining is very popular, but you should be wary of NJ trees involving large numbers of taxa. This analysis involved 35 taxa; for problem of this size or larger, it is almost certain that NJ will not find the best tree.

Part C: Balanced minimum evolution search in FastME

FastME is a program written by Desper and Gascuel. You can download it from here. Among other analyses (such as BIONJ), it implements fast NNI searching under the balanced minimum evolution criterion. NJ is a quick and dirty search under this criterion, and FastME can do branch swapping to find even better trees. The program produces trees very quickly, and this is best demonstrated on large datasets.

Click here to get a large (567 taxon, 2153 character) data file. Save it as 567Tx2153C.nex in a new folder on your local hard drive.

FastME does not analyze sequences directly. Instead we have to give it a distance matrix. We will use PAUP to create the input distance matrix.

Execute the 567Tx2153C.nex file in PAUP. Tell PAUP to use the HKY distance and then export the data file in a format that FastME can read using the command:

savedist format = phylip triangle = both diagonal file=dists.txt ;

While you have PAUP open, do a nj search, score the tree using minimum evolution, ordinary (unweighted) least squares, and weighted least squares.

Conduct a search under the weighted least squares criterion using the command: HSearch NoMulTrees; Score this tree under minimum evolution, ordinary (unweighted) least squares, and weighted least squares and note the amount of time the search took.

Now we will run FastME on the distance matrix:

FastME_MacOS
The program should prompt you for the name of the input file. At this point enter dists.txt and hit return. You should see a menu of options that let you choose what criterion to optimize and which algorithm to use. Conduct a Balanced Generalized Minimum Evolution search using the balanced NNI search. The program exits when it is finished (and it won't take long).

It should produce two output files: dists.txt_stat.txt and dists.txt_tree.txt. Open both in a text editor. To get the tree into PAUP we will have to change the tree file into a NEXUS file. Fortunately all you have to do to accomplish this is add:

#NEXUS
begin trees;
tree fastme = [&U]
to the file before the parenthetical notation, and
end;
to the end of the file.

Execute the 567Tx2153C.nex file and set the distance back to the HKY corrected distance. Use PAUP's GetTrees command to do this. Score this tree under minimum evolution, ordinary (unweighted) least squares, and weighted least squares and note the amount of time the search took.

How did the tree from FastME compare to the trees that you obtained by searching in PAUP?

References cited

  1. * Lockhart, P. J., M. A. Steel, M. D. Hendy, and D. Penny. 1994. Recovering evolutionary trees under a more realistic model of sequence evolution. Molecular Biology and Evolution 11:605-612.
  2. * Felsenstein, J. 1985. Confidence intervals on phylogenies: an approach using the bootstrap. Evolution 39:783-791.