BIOL 848: Phylogenetic Methods |
In this lab we will use the same algae.nex file that we have used in several other labs. Recall that the cyanobacterium, Anacystis, and the cholorplast from Olithodiscus where the only two taxa to lack chlorophyll b. The other taxa are represented by chloroplast sequences that are thought to have descended from a single endosymbiontic organism (they all have both chlorophyll a and b). Thus we expect the other taxa to form a clade.
Because we have a limited amount of time in the computer lab, we will be using parsimony as our optimality criterion in this lab. The general principles of using Monte Carlo simulation to construct a null distribution apply to any optimality criterion, so you can adapt the lab to hypothesis testing using ML or distance-based approaches.
Download (again) the data file algae.nex.
Launch PAUP and perform an appropriate parsimony search to get the most parsimonious tree/trees in memory. (Note the size of the data set and choose the appropiate PAUP commands).
How many most parsimonious trees are there? answer
How many steps are required to explain the data set on this tree? answer
Does the tree display the chlorophyll a/b clade (you should specify Anacystis as the outgroup to see the trees rooted correctly)? answer
Now we want to find the highest scoring tree that has a clade of Euglena, Chlorella, Chlamydomonas, Marchantia, Rice, and Nicotiana. We can do this in a number of ways. For a small dataset (like this one), we could get all of the trees is memory, use PAUP's Filter command to remove trees that do not have the group from memory, and finally use the SortTrees command to find the best score among all of the trees that are left.
Constraints some_name (MONOPHYLY) = newick_description_hereIn the place of "newick_description_here" you should put the paranthetical notation for a tree that has just the clade that you would like to constrain.
Anacystis_nidulans
rather than Anacystis nidulans
so that PAUP does not get confused by the space in the taxon name.
What was the command? answer
You should be aware that you can also load a constraint from a tree file using the LoadConstr command. When you are dealing with large numbers of taxa and complex constraints it is often helpful to construct the constraint tree in Mesquite, save it to a file, and then read it into PAUP using the LoadConstr command.
BAndB enforce constraints=ab ;The enforce keyword tells paup to only consider trees that meet a constraint, and the constraints=ab keyword tell PAUP which tree to use as a constraint. Note that you can also use constraints with the HSearch command of paup.
As our test statistic, we will use the difference in parsimony score between the best (unconstrained) tree and the best tree that satisfies our null hypothesis. What is the value of the test statistic for our data? answer
To interpret the value of the test statistic we need to have some ideas what range of values would be produced if the null hypothesis were true. This is tricky to do. For one thing, there are lots of trees that are compatible with the null hypothesis. It seems logical (but is not necessarily correct) to take the tree from the constrained search as the tree to repersent the null hypothesis. After all, among all of the trees compatible with the data it is the one that best explains the data (according to parsimony).
Even if we are satisfied about the choice of a tree that represents the best the null can do, we still have to have a way to find out what the distribution of the test statistic would be if this null were true. We will use Monte Carlo simulations for this. In particular we will use seq-gen (available here) to generate a bunch of datasets that are compatible with the kind of data that we would see if the null were true. Then we will calculate the test statistic on each of them. This will give us a null distribution of the test statistic. We can compare our real data to that.
set crit = like; lset nst=6 rmat=est basefreq=est rates = gamma shape = est pinv=est; lscore; savetrees file = model.tre brlens format=altnexus;Make sure that you understand these commands.
cat model.tre | grep PAUP_1 | awk '{print $5}' > model.txt
seq-gen -h
seq-gen -mHKY model.txtThis should generate a dataset and print it to the screen.
begin paup; execute run.nex; end;to a file called defpaup.nex
seq-gen -mGTR -a0.615276 -i0.357336 -r 0.64253 2.17067 0.75026 0.21462 4.68720 1.00000 -f 0.275521 0.209086 0.302208 0.213185 -l920 -n1000 -on -xdefpaup.nex model.txt > simdata.nexUse the values that you got from PAUP's LScore to construct a similar command and run it.
Now we have 1000 datasets. How are we going to analyze them all? Fortunately we have the PAUP command execute run.nex; intercalated between each data set. If we put the commands that we want paup to execute in the run.nex file then those commands will be executed for each dataset.
What do we want to do for each dataset? Well we want to see the difference in score that we get between the true tree and the preferred (most-parsimonious) tree. This will give us a distribution on the amount of spurious support we could get for a tree even when it is not correct.
BAndB; [!****This is the best tree's score****] pscore; gettrees file = model.tre; [!####This is the true tree's score####] pscore;These commands find the "best" tree, score it, get the null model tree (the true tree for the simulations), and score it. We are using the output comments to make the output more visible.
log start replace file=sim.log; execute simdata.nex; log stop;Save this file as master.nex
paup -n master.nex
cat sim.logspits out all 83,000 lines of the log file to the screen. Ugh!
cat sim.log | grep "Length "filters all of those lines so that only those with the word Length followed by a space are printed. This selects just the output from the PScore command that we did.
cat sim.log | grep "Length " | awk '{print $2}'
cat sim.log | grep "Length " | awk '{print $2}' | python consecutiveDiffs.pyShould display the length differences.
cat sim.log | grep "Length " | awk '{print $2}' | python consecutiveDiffs.py | sort -gto sort the values numerically.
cat sim.log | grep "Length " | awk '{print $2}' | python consecutiveDiffs.py | sort -g | head -n50to show the 50 most extreme length differences.