BIOL 848: Phylogenetic Methods

This lab was written by Paul O. Lewis and being used in BIOL 848 (with slight modifications) by Mark Holder with Paul's permission (the original lab is here).

Lab 3: Tree searching using PAUP (and parsimony).

This lab consists of two parts:

Part A: Using PAUP* to check your answers for homework #4

  1. Create a NEXUS data file containing the data for the one DNA site with A for taxa 1 and 2, and G for taxon 3, and C for taxon 4. To do this, open TextEdit (or any other text editor) File > New from the main menu to create an empty file. Referring to the description of a nexus data block from the previous lab, build a NEXUS file containing 4 taxa and 1 character. Be sure to save the file, giving it a name like homework4.nex.

  2. Execute your new data file to make sure there were no mistakes:
    1. change your working directory to the directory with your file
    2. launch PAUP*
    3. execute homework4.nex ;
    Even though you are not yet finished setting up the file, it is always a good idea to periodically check to make sure PAUP* can interpret the contents. If you wait too long before doing this check, then you end up making multiple mistakes and it gets increasingly hard to figure out how to fix all of them! If your file executes successfully, you should see a line such as the following:
    Processing of file "homework4.nex" completed.


  3. Go back to your text editor, and add a stepmatrix definition to specify that transitions should cost 1 step while transvesions should cost 2 steps. Add an assumptions block to your data file (note: do not add another "#NEXUS" to your file, this should appear only as the first thing in a NEXUS file). Your assumptions block should look like this:
    begin assumptions;
     usertype my_ctype stepmatrix=4
         A C G T
     [A] . 2 1 2
     [C] 2 . 2 1
     [G] 1 2 . 2
     [T] 2 1 2 .
     ;
    end;
    Feel free to replace "my_ctype" with a name of your choosing (but avoid using spaces or punctuation; note that I used an underscore character instead of a space). Save your file (using <Ctrl-s> is easiest), then re-execute using <Ctrl-r> to ensure that you haven't made any mistakes.

  4. Execute the file in PAUP again. Now we will apply the newly-defined stepmatrix to character 1. First use the PAUP* cstatus command to examine the current status of your single character:
    cstatus full
    The full keyword produces a list showing information about each character individually. What is the current type of character? Which variant of parsimony would you be using were you to perform an analysis at this point (Wagner, Fitch, Transversion, or Ordered)? Use the PAUP* ctype command to apply your new usertype to character 1 as follows:
    ctype my_ctype: 1;
    PAUP* will not respond with any confirmation, but check to make sure that your stepmatrix has been applied to character 1 using the cstatus command again. Note that this time PAUP* says
    All characters are of user-defined type "my ctype"
    You may be wondering where the underscore character went: when producing output, PAUP* converts underscore characters to spaces in taxon, tree, character, and usertype names.

  5. Create a PAUP block in your data file so that the usertype can be applied automatically each time you execute the file. The next time you open and execute this file, the usertype will be defined but it will not be applied to any characters. Adding a paup block to the file allows you to perform some operations (such as applying a usertype to one or more characters) each time you execute the file. Create a paup block after your assumptions and data blocks in the file homework4.nex. Add two commands to your paup block: 1) the ctype command above that applies your usertype to character 1; and 2) the cstatus full command. Be sure to end each command with a semicolon! After saving your work, quit PAUP* and relaunch it, executing homework4.nex after PAUP* starts up. Notice that it is now clear that your usertype has been applied already.

  6. Search through all possible trees for the one having the best parsimony score. Use the alltrees command for this. With so few possible trees to consider, PAUP* will finish this almost immediately
    PAUP* will present a histogram in which each bar represents trees of a given length (i.e. number of steps). The tree length is the number at the base of the bar (to the left). The number in parentheses at the top of a bar (to the right) is the number of trees having that length. Do these results agree with your hand calculations?

  7. Use PAUP* to verify ancestral character state assignments. You found that there was some ambiguity about ancestral character states for tree 1 (the tree that places taxon 1 with taxon 2. In particular, you found that both C and T were equally viable possibilities for one of the interior nodes, whereas only A was viable at the other interior node. To get PAUP* to verify this, use the following command:
    describetrees 1 / mprsets
    This tells PAUP* to describe tree 1, and the option mprsets tells PAUP* to show what character state assignments are possible at each interior node (the Most Parsimonious Reconstruction sets). Is the result what you expected based on the hand calculations you performed for the homework assignment?

Part B: Searching under the parsimony criterion

  1. Create the data file. Download the file from here and save it on the machine that you are working on.
  2. Create a command file. Create a blank file, then type in the following text, saving this as run.nex:
    #nexus
    
    begin paup;
      log file=output.txt start replace;
      execute angio35b.nex;
    end;
    in the same directory that holds the angio35b.nex file that you downloaded
  3. Execute run.nex, which will in turn execute angio35b.nex. There at least a couple of advantages to creating little NEXUS files like run.nex. For now, the only advantage is that executing run.nex automatically starts a log file so that you will have a record of what you did. Later, when you get in the habit of putting commands in paup blocks, you will appreciate the separation of the data from the commands that initiate analyses (I have many times opened a data file, forgetting about the embedded paup block that then starts a long search, overwrites my previous log file, and otherwise creates havoc).

    Note that because we used the replace keyword in the log command, the file output.txt will be overwritten without warning if it exists. This is called living dangerously, so you may want to refrain from using the replace keyword so that PAUP* asks before overwriting files.
  4. Delete all taxa except the first five. Using this command
    delete 6-.
    will cause PAUP* to ignore all taxa except Ephedrasinica, Gnetum_gnemJS, WelwitschiaJS, Ginkgo_biloba, and Pinus_ellCH.
  5. Perform an exhaustive search using parsimony. Use the alltrees command for this. This should go fast because you now have only 5 taxa.
    • How many separate tree topologies did PAUP* examine?
    • What is the parsimony treelength of the best tree? The worst tree?
    • How many steps separate the best tree from the next best?
  6. Perform an heuristic search using NNI branch swapping. Before you start, use the describe command to show you the tree obtained from the exhaustive enumeration.
    • Draw this tree on a piece of paper and then draw the 4 possible NNI rearrangements
  7. Find all NNI rearrangements of the best tree. Note that because we performed an exhaustive enumeration, we now know which tree is the globally most parsimonious tree. We are thus guaranteed to never find a better tree were we to start an heuristic search with this tree. Let's do an experiment: perform an NNI heuristic search, starting with the best tree, and have PAUP* save all the trees it encounters in this search. In the end, PAUP* will have in memory 5 trees: the starting tree and the 4 trees corresponding to all possible NNI rearrangements of that starting tree: hsearch start=1 swap=nni nbest=15;
    • start=1 starts the search from the tree currently in memory (i.e., the best tree resulting from your exhaustive search using the parsimony criterion)
    • swap=nni causes the Nearest-Neighbor Interchange (NNI) method to be used for branch swapping
    • nbest=15 saves the 15 best trees found during the search. Thus, were PAUP* to examine every possible tree, we would end up saving all of them in memory. The reason this command is needed is that PAUP* ordinarily does not save trees that are worse than the best one it has seen thus far. Here, we are interested in seeing the trees that are examined during the course of the search, even if they are not as good as the starting tree.
  8. Show all 5 trees in memory. Use the describe all command to plot the 5 trees currently in memory. The reason we are using the describe command rather than the showtrees command is because we want PAUP* to show us the numbers it has assigned to the internal nodes, something that showtrees doesn't do.
    • Which tree was the original tree?
    • Which trees correspond to NNI rearrangments of which internal edges on the original tree?
  9. Find the most parsimonious tree for all 35 taxa. Restore all taxa using the restore all command (this will wipe out the 5 trees you currently have stored in memory, but that is ok), then conduct a heuristic search having the following characteristics:
    • The starting trees are each generated by the stepwise addition method, using random addition of sequences
    • Swap using NNI branch swapping
    • Reset the nbest option to all because we want to be saving just the best trees, not suboptimal trees (yes, this option is a little confusing).
    • Set the random number seed to 5555 (this determines the sequence of pseudorandom numbers used for the random additions; ordinarily you would not need to set the random number seed, but we will do this here to ensure that we all get the same results)
    • Do 500 replicate searches; each replicate represents an independent search starting from a different random-addition tree
      Here is the full command implementing this search: hsearch start=stepwise addseq=random swap=nni nbest=all rseed=5555 nreps=500;
    • How many tree islands were found?
    • How long did the search take?
    • How many rearrangements were tried?
  10. Conduct a second search with SPR swapping. Be sure to reset the random number seed to 5555. You should be able to figure out how to do this using the output from hsearch ?. Note that to save typing you can call up previously entered commands using the little buttons on the right of the command line edit control.
    • How many tree islands were found?
    • What are the scores of the trees in each island?
    • How long did the search take this time?
    • How many rearrangements were tried?
  11. Now conduct a third search with TBR swapping.
    • How many tree islands were found?
    • What are the scores of the trees in each island?
    • How long did the search take this time?
    • How many rearrangements were tried?
    • How many trees are currently in memory (use the treeinfo command)?
    • Has PAUP* saved trees from all islands discovered during this search? (Hint: compare "Number of trees retained" to the sum of the "Size" column in the Tree-island profile.) Explain why PAUP* saved the number of trees it did.

      Wondering about that warning "Multiple hits on islands of unsaved trees may in fact represent different islands"? When PAUP* encounters a new island, it will find all trees composing that particular island in the process of branch swapping. Thus, if (in a new search) it encounters any trees already stored in memory, it knows that it has hit an island that it found previously. Note that it would be pointless to continue on this tack, because it will only find all the trees on that island again. For trees retained in memory, PAUP* can keep track of which island they belong to (remember that it is possible for trees with the same parsimony score to be in different tree islands!). But for trees that are not retained in memory, PAUP* only knows that it has encountered an island of trees having score X; it has no way of finding out how many islands are actually represented amongst the trees having score X.
    • Of the three types of branch swapping (NNI, SPR, TBR), which is the most thorough? Least thorough?
    • How many TBR rearrangements can PAUP* examine on the computer you are using in one second (on average)?
    • Based on this, how long would it take to examine all possible trees if only 20 sequences were included?
Feel free to look at the NEXUS datasets that cause multiple NNI islands (here) and the cause star decomposition to fail even in the absence of homoplasy (here).