BIOL 848: Phylogenetic Methods

This lab was written by Paul O. Lewis and being used in BIOL 848 (with slight modifications) by Mark Holder with Paul's permission.

The goal of this lab exercise is to show you how to conduct distance-based analyses in PAUP* and FastME.

Contents

Part A: Basic distance analyses in PAUP

Download the data file

Download the angio35b.nex that we used in the last lab week. Save the file to your local computer.

Create a PAUP* command file

Use a text editor to create a new file. Save it as run.nex in the same directory as the angiob.nex file. This new file will be a NEXUS file that contains the PAUP block with the commands for PAUP. Here are the commands:

#NEXUS

begin paup;
  execute angio35b.nex;
  dset dist=abs;
  delete 5-.;
  exclude missambig;  
  showdist;
end;

Type (or copy these commands in the file and save it as plain text -- not RTF).

The file run.nex tells PAUP* to:

Save the run.nex file. Then execute it PAUP*, and examine the output.

Calculate p distances and JC distances

To see the distances as a proportion of sites that differ for the sequences just change the distance measure from abs to p and re-execute run.nex.

Examine the resulting data matrix, change the distance correction in run.nex from p to jc to tell PAUP to use the Jukes-Cantor distance. Re-execute the file.

Execute the dset ? command to see all of the distance settings that are available in PAUP. If you are confused by an option, you can check it out by downloading the PAUP manual from here (or by asking me about an option).

Estimate edge lengths using weighted least squares

Next, perform a search using weighted least squares (weighted usually implies that the power is 2 in the denominator of the sum-of-squares formula). Add the following line to your run.nex file, just below the execute angiob35.nex; line:

set criterion=distance ;

This command tells PAUP* to use the distance optimality criterion specified by the objective option of the dset command during the search. If you were to leave this out, PAUP* would use the default optimality criterion (parsimony). Now issue the command dset ? in PAUP* to get a listing of the current values of all distance settings.

If your answer is not 2, then add this line to your paup block, below the execute command:

dset power=2;

If your answer is not lsfit, then add this line to your paup block, also below the execute command:

dset objective=lsfit;

Finally, add the following two lines to the end of your paup block to perform the search and show the resulting best tree, including the estimated branch lengths:

alltrees;
describe 1 / brlens;

Save and re-execute the file.

Searching under the minimum evolution criterion

Before moving on to the next exercise, repeat the above search using the minimum evolution (or ME) criterion. To do this, you will need to add the objective=ME option to your dset command (be sure to remove the previous dset objective=lsfit if you had to put it in) and re-execute run.nex.

Part B: Analysis of algae.nex

Create the data file algae.nex

Click here to get the data file. Save it as algae.nex in a new folder on your local hard drive. These data were originally used in a 1994 study by Lockhart et al.[1] and comprise eight 16S ribosomal RNA sequences:

Anacystis a cyanobacterium (has chlorophyll a but not b or c)
Olithodiscus a chloroplast from a chromophyte alga (chlorophylls a and c)
Euglena a chloroplast from a photosynthetic euglenophyte protist (chlorophylls a and b)
Chlorella a chloroplast from a a chlorophyte green alga (chlorophylls a and b)
Chlamydomonas a chloroplast from a chlorophyte green alga (chlorophylls a and b)
Marchantia a chloroplast from a thallose liverwort (chlorophylls a and b)
Oryza a chloroplast from a monocot (chlorophylls a and b)
Nicotiana a chloroplast from a dicot (chlorophylls a and b)

All of these organisms except Anacystis and Olithodiscus have chlorophylls a and b. Based on independent evidence, it is probable that all chlorophyll a/b-containing chloroplasts have a common endosymbiotic origin, so we would expect trees constructed from these data to show a branch separating Anacystis and Olithodiscus from everything else. The cyanobacterium Anacystis uses phycobilin accessory pigments rather than chlorophylls for photosynthesis, and the chromophyte alga Olithodiscus has chlorophylls a and c (but not b).

Create a PAUP* command file.

As you did above create a new run.nex file to serve as a PAUP command file. Copy the following text and paste it into the new text file window, then save this file as runalgae.nex, placing it in the same directory as algae.nex:
#nexus

begin paup;
  log file=lab3.txt start replace;
  set torder=right;
  execute algae.nex;
  outgroup Anacystis_nidulans;
  set criterion=distance;
  
  [!
  ****************************
  ** JC Model, ME Criterion **
  ****************************
  ]
  dset distance=jc objective=me;
  alltrees;
  describe 1;
  
  [other commands here]
  log stop;
end;

During the course of this lab, you can simply add commands to this paup block rather than creating a paup block in the data file itself. The set torder=right command simply causes trees to be displayed so that the outgroup is at the top and the tree appears to flow to the right (this is often called ladderizing right). Try changing this to set torder=left to see what ladderizing left looks like.

Perform an exhaustive search under the minimum evolution and least squares criteria

Use both the JC and logdet models combined with both Minimum Evolution and Least Squares (using the default power of 2) optimality criteria. Use dset dist=? to find out how to specify the model, and use dset objective=? to find out how to specify which of the two distance-based optimality criteria to use. Set up all 4 analyses (2 models times 2 optimality criteria) in the paup block (the first one has already been done for you), then run them all at once by executing the file. Note the printable comment (it starts with an exclamation point). Making comments like this in your paup block will allow you to easily find where the results from each model start in the output.

The main point here is that the model used can make a difference in whether you get the correct answer. The logdet model has not yet been discussed: a bit more background on models is needed before an explanation of the logdet model will make sense. For now, just know that the logdet model is one of the few models in common use that is not tricked by nucleotide composition that changes across the tree. Models like JC that assume nucleotide composition is constant tend to group taxa with similar nucleotide composition (e.g. GC-rich sequences) even if they are not closest relatives. Logdet is not so easily fooled in such circumstances, but can be tricked by other features of sequence evolution (such as among-site rate heterogeneity).

Perform a bootstrap analysis

We will cover bootstrapping and majority-rule consensus trees later in the course. But here is some brief background info:

Bootstrapping (introduced to phylogenetics by Joe Felsenstein in 1985[2]) is one of the most common methods for assessing which parts (i.e. edges) of an estimated tree are best supported and which are poorly supported by the data. In bootstrapping, many new datasets are created by sampling (with replacement) characters from the original dataset. A tree is estimated from each bootstrap replicate dataset, and at the end a consensus tree is produced that has in it all splits showing up in a majority of the bootstrap trees. The idea is that the original dataset can be thought of as one sample of characters from a vast pool of characters, and producing other data sets by resampling comes as close as possible to collecting data from other genes similar to the one for which you did collect data. The majority-rule consensus tree does not include splits that were present in fewer than half of the trees estimated from bootstrap replicate data sets, but the bipartition table that is produced by PAUP* provides the relative frequency of these less common splits.

Perform a bootstrap analysis (500 replicates, heuristic search using least squares) under the F84 model. The commands for doing this are shown below. I suggest you copy these into a new paup block in your runalgae.nex file. Be sure to disable your previous paup block by renaming it; if you change the name of a paup block slightly, for example, changing begin paup; to begin _paup, PAUP* will skip it when executing the file.
dset distance=f84 objective=lsfit power=2;
bootstrap nrep=500 search=heuristic bseed=5555;

Sometimes a split will just barely fail to make the cut (e.g. it appeared in 49% of the bootstrap trees), and so it is a good idea to check the bipartition table to make sure you don't fail to notice these. In this case, the F84 model behaves much like JC, putting the chlorophyll a/c containing Olithodiscus inside the chlorophyll a/b clade. You should have found that the bootstrap support for the split separating the chlorophyll a/b clade from the remaining two taxa is 31.4% (i.e. not very close to the 50% mark required for inclusion in the majority-rule consensus tree).

Compare NJ with a comparable heuristic search

In this exercise, you will conduct a Neighbor-joining (NJ) analysis using HKY distances and compare this with an heuristic search using the minimum evolution criterion. The goal of this section is to demonstrate that it is possible for heuristic searching to find a better tree than NJ, even using the same optimality criterion.

Please quit PAUP* and start it again. The reason for this will be made clear later, but mainly the purpose is to return all settings to their default values.

Put the commands below in a paup block in a new file. Note that we are again using the angio35.nex file:

execute angio35.nex;
dset distance=hky objective=me;
nj;
dscores 1;
hsearch start=1;
dscores 1;

Once you have figured out what is going on (ask us for help if you are stumped), fix your paup block and re-execute the file.

In your reanalysis, you should find that the heuristic search starting with the NJ tree found a better tree using the same optimality criterion (minimum evolution) being used by NJ. Neighbor-joining is very popular, but you should be wary of NJ trees involving large numbers of taxa. This analysis involved 35 taxa; for problem of this size or larger, it is almost certain that NJ will not find the best tree.

Part C: Balanced minimum evolution search in FastME

FastME is a program written by Desper and Gascuel. You can download it from here. Among other analyses (such as BIONJ), it implements fast NNI searching under the balanced minimum evolution criterion. NJ is a quick and dirty search under this criterion, and FastME can do branch swapping to find even better trees. The program produces trees very quickly, and this is best demonstrated on large datasets.

Click here to get a large (567 taxon, 2153 character) data file. Save it as 567Tx2153C.nex in a new folder on your local hard drive.

FastME does not analyze sequences directly. Instead we have to give it a distance matrix. We will use PAUP to create the input distance matrix.

Execute the 567Tx2153C.nex file in PAUP. Tell PAUP to use the HKY distance and then export the data file in a format that FastME can read using the command:

savedist format = phylip triangle = both diagonal file=dists.txt ;

While you have PAUP open, do a nj search, score the tree using minimum evolution, ordinary (unweighted) least squares, and weighted least squares.

Conduct a search under the weighted least squares criterion using the command: HSearch NoMulTrees; Score this tree under minimum evolution, ordinary (unweighted) least squares, and weighted least squares and note the amount of time the search took.

Now we will run FastME on the distance matrix:

FastME_MacOS
The program should prompt you for the name of the input file. At this point enter dists.txt and hit return. You should see a menu of options that let you choose what criterion to optimize and which algorithm to use. Conduct a Balanced Generalized Minimum Evolution search using the balanced NNI search. The program exits when it is finished (and it won't take long).

It should produce two output files: dists.txt_stat.txt and dists.txt_tree.txt. Open both in a text editor. To get the tree into PAUP we will have to change the tree file into a NEXUS file. Fortunately all you have to do to accomplish this is add:

#NEXUS
begin trees;
tree fastme = [&U]
to the file before the parenthetical notation, and
end;
to the end of the file.

Execute the 567Tx2153C.nex file and set the distance back to the HKY corrected distance. Use PAUP's GetTrees command to do this. Score this tree under minimum evolution, ordinary (unweighted) least squares, and weighted least squares and note the amount of time the search took.

How did the tree from FastME compare to the trees that you obtained by searching in PAUP?

References cited

  1. Lockhart, P. J., M. A. Steel, M. D. Hendy, and D. Penny. 1994. Recovering evolutionary trees under a more realistic model of sequence evolution. Molecular Biology and Evolution 11:605-612.
  2. Felsenstein, J. 1985. Confidence intervals on phylogenies: an approach using the bootstrap. Evolution 39:783-791.