BIOL 848: Phylogenetic Methods

This is yet another lab borrowed from Paul Lewis. Thanks, Paul. The goal of this lab exercise is to show you how to conduct use simulation to construct a null distribution for a phylogenetic hypothesis.

Background

In this lab we will use the same algae.nex file that we have used in several other labs. Recall that the cyanobacterium, Anacystis, and the cholorplast from Olithodiscus where the only two taxa to lack chlorophyll b. The other taxa are represented by chloroplast sequences that are thought to have descended from a single endosymbiontic organism (they all have both chlorophyll a and b). Thus we expect the other taxa to form a clade.

Because we have a limited amount of time in the computer lab, we will be using parsimony as our optimality criterion in this lab. The general principles of using Monte Carlo simulation to construct a null distribution apply to any optimality criterion, so you can adapt the lab to hypothesis testing using ML or distance-based approaches.

Get a data file

Download (again) the data file algae.nex.

Find the score of the most parsimonious tree.

Launch PAUP and perform an appropriate parsimony search to get the most parsimonious tree/trees in memory. (Note the size of the data set and choose the appropiate PAUP commands).

How many most parsimonious trees are there? answer

How many steps are required to explain the data set on this tree? answer

Does the tree display the chlorophyll a/b clade (you should specify Anacystis as the outgroup to see the trees rooted correctly)? answer

Find the most parsimonious tree that is consistent with our null hypothesis

Now we want to find the highest scoring tree that has a clade of Euglena, Chlorella, Chlamydomonas, Marchantia, Rice, and Nicotiana. We can do this in a number of ways. For a small dataset (like this one), we could get all of the trees is memory, use PAUP's Filter command to remove trees that do not have the group from memory, and finally use the SortTrees command to find the best score among all of the trees that are left.

Specifying a constraint

  1. We will actually use the more generic approach (the one that you would use even on a large dataset). We will tell PAUP to do a search but also honor a topological constraint. The first step is to tell PAUP to use the constraint. The command to do this is
    Constraints some_name (MONOPHYLY) = newick_description_here
    In the place of "newick_description_here" you should put the paranthetical notation for a tree that has just the clade that you would like to constrain.
    See if you can define a constraint tree called ab that puts Tobacco, Rice, Marchantia, Chlamydomonas, Chlorella, and Euglena in a clade (excluding Anacystis_nidulans and Olithodiscus). Use the taxa names (rather than taxa numbers) to define constraint. Note that you will have to use Anacystis_nidulans rather than Anacystis nidulans so that PAUP does not get confused by the space in the taxon name.
  2. Use the ShowConstr command to see the constraint tree that you have defined and make sure that it looks right?

    What was the command? answer

You should be aware that you can also load a constraint from a tree file using the LoadConstr command. When you are dealing with large numbers of taxa and complex constraints it is often helpful to construct the constraint tree in Mesquite, save it to a file, and then read it into PAUP using the LoadConstr command.

A constrained search

  1. Now we will conduct a branch-and-bound search that honors the constraint tree that we have just defined:
    BAndB enforce constraints=ab ;
    The enforce keyword tells paup to only consider trees that meet a constraint, and the constraints=ab keyword tell PAUP which tree to use as a constraint. Note that you can also use constraints with the HSearch command of paup.
    What has the parsimony score of the best tree compatible with the constraint? answer
  2. Use the ShowTrees command to see the tree. Does it satisfy the constraint? answer

Hypothesis testing

As our test statistic, we will use the difference in parsimony score between the best (unconstrained) tree and the best tree that satisfies our null hypothesis. What is the value of the test statistic for our data? answer

To interpret the value of the test statistic we need to have some ideas what range of values would be produced if the null hypothesis were true. This is tricky to do. For one thing, there are lots of trees that are compatible with the null hypothesis. It seems logical (but is not necessarily correct) to take the tree from the constrained search as the tree to repersent the null hypothesis. After all, among all of the trees compatible with the data it is the one that best explains the data (according to parsimony).

Even if we are satisfied about the choice of a tree that represents the best the null can do, we still have to have a way to find out what the distribution of the test statistic would be if this null were true. We will use Monte Carlo simulations for this. In particular we will use seq-gen (available here) to generate a bunch of datasets that are compatible with the kind of data that we would see if the null were true. Then we will calculate the test statistic on each of them. This will give us a null distribution of the test statistic. We can compare our real data to that.

Finding parameters

  1. To simulate data, seq-gen needs a fully-specified model and a tree with branch lengths. We can use the tree that we found in the constrained search and the GTR+I+G model to get the necessary input. Assuming that you have not quit PAUP and the tree found in the constrained search is still in memory:
    set crit = like;
    lset nst=6 rmat=est basefreq=est rates = gamma shape = est pinv=est;
    lscore;
    savetrees file = model.tre brlens format=altnexus;
    
    Make sure that you understand these commands.
    From the standard output of PAUP you should have the model parameter values for the simulation. If you look at the tree file you should have a tree with branch length.
    Does the tree have taxa names in the newick representation instead of taxon numbers? answer
  2. Unfortunately, seq-gen does not understand NEXUS tree files. Save just the newick tree from the file to a new file called model.txt.
    If you are a UNIX geek you do this by quitting paup and issuing the command:
    cat model.tre | grep PAUP_1 | awk '{print $5}' > model.txt
    

Invoking seq-gen

  1. Download the Mac disk image of seq-gen (from here) and drag the seq-gen executable to the directory that you are using for this lab.
  2. To see the options for seq-gen use the command
    seq-gen -h
  3. To make sure everything is working do a simple test run using the HKY model:
    seq-gen -mHKY model.txt
    This should generate a dataset and print it to the screen.
  4. The simulation used it default parameter values for the HKY model. We'd like to use the parameters that we inferred from our real data (because the parameter values will affect the dataset-to-dataset variance, and hence the distribution of our test statistic). All commands are given to seq-gen as command line flags. The ones that we will use are: Finally we'll want to redirect the output to file using the > redirection operator. We have to do one more thing before running seq-gen.
  5. Save the following lines:
    begin paup;
    	execute run.nex;
    end;
    
    to a file called defpaup.nex
  6. My full seq-gen invocation was:
    seq-gen -mGTR -a0.615276 -i0.357336 -r 0.64253 2.17067 0.75026 0.21462 4.68720 1.00000 -f 0.275521 0.209086 0.302208 0.213185 -l920 -n1000 -on -xdefpaup.nex model.txt > simdata.nex
    Use the values that you got from PAUP's LScore to construct a similar command and run it.
  7. Open simdata.nex in a text editor. Do you see the content of the defpaup.nex file?

Running PAUP on the simulated data

Now we have 1000 datasets. How are we going to analyze them all? Fortunately we have the PAUP command execute run.nex; intercalated between each data set. If we put the commands that we want paup to execute in the run.nex file then those commands will be executed for each dataset.

What do we want to do for each dataset? Well we want to see the difference in score that we get between the true tree and the preferred (most-parsimonious) tree. This will give us a distribution on the amount of spurious support we could get for a tree even when it is not correct.

  1. Save the following commands to the run.nex file:
    BAndB;
    [!****This is the best tree's score****]
    pscore;
    gettrees file = model.tre;
    [!####This is the true tree's score####]
    pscore;
    
    These commands find the "best" tree, score it, get the null model tree (the true tree for the simulations), and score it. We are using the output comments to make the output more visible.
  2. Finally to run all of the analyses it is helpful to have a simple "master" paup file:
    log start replace file=sim.log;
    execute simdata.nex;
    log stop;
    
    Save this file as master.nex
  3. invoke PAUP using the -n flag so that it goes in non-interactive mode (and does not pester you with 1000 questions):
    paup -n master.nex
    
  4. After a few seconds, you should have completed the analysis of all 100 datasets.

Summarizing the output

Behold the power of UNIX!
We are going to connect the standard output of one process to the standard input of another. This is done with a "pipe" between the processes. From our shell, this is done with the | symbol.
  1. The command:
    cat sim.log
    spits out all 83,000 lines of the log file to the screen. Ugh!
  2. The command:
    cat sim.log | grep "Length "
    filters all of those lines so that only those with the word Length followed by a space are printed. This selects just the output from the PScore command that we did.
  3. Want to get just the scores of the first tree each time (without the word Length)? Try:
    cat sim.log | grep "Length " | awk '{print $2}'
  4. We are close. We now need to subtract the number in the second line from the first line; the number in the fourth line from the third line... This would give us the difference in score for each rep. I wrote a simple python script that you can download from here to do that. Save the script as consecutiveDiffs.py in the same directory that you have been working in.
  5. Now
    cat sim.log | grep "Length " | awk '{print $2}' | python consecutiveDiffs.py
    Should display the length differences.
  6. At this point (or really a couple of steps ago) you could take the data into Excel and find the critical value for the p=0.05 level by looking for the 50th largest difference. We can finish the UNIX way by
    cat sim.log | grep "Length " | awk '{print $2}' | python consecutiveDiffs.py | sort -g
    to sort the values numerically.
  7. And finally:
    cat sim.log | grep "Length " | awk '{print $2}' | python consecutiveDiffs.py | sort -g | head -n50
    to show the 50 most extreme length differences.
  8. Was the observed difference in this tail?