This lab was written by Paul O. Lewis and being used in BIOL 848 (with slight modifications) by Mark Holder with Paul's permission (the original lab is here). Thanks, Paul!
Lab 2: Introduction to PAUP* and the NEXUS data file format
Contents
Introduction to PAUP* 4.0
PAUP* 4.0 is the successor to PAUP 3.1, which was published in 1993 by David L. Swofford (currently at the Duke University's
Institute for Genome Sciences and Policy and
NESCent). The name PAUP means
Phyogenetic
Analysis
Using
Parsimony because parsimony was the only optimality criterion employed at the time. The asterisk in the name PAUP* means
and other methods. PAUP* is one of the most comprehensive phylogenetic analysis computer programs available, and we will use PAUP to illustrate many of the phylogenetic methods we talk about during the course.
PAUP* Home Page
The
PAUP* Home Page is the best place to go for up-to-date information about program availability, known problems/workarounds, and help in the form of a FAQ and electronic forum. As of this writing, PAUP* is being sold by
Sinauer Associates (price varies according to platform). While it is not a free program, you really do get a lot for your money compared to most other commercial software, as the next section is designed to illustrate.
What can PAUP* do?
PAUP* is capable of performing many of the types of phylogenetic analyses The following listing is not exhaustive, but is designed to give you an idea of what PAUP* can currently do (we'll cover most of these methods later in class):
-
Algorithmic searching: Exhaustive, Branch-and-bound, Stepwise Addition, Neighbor-joining, Puzzling, UPGMA, Star Decomposition
-
Heuristic searching: Nearest-neighbor Interchange (NNI), Subtree Pruning/Regrafting (SPR), Tree Bisection/Reconnection (TBR)
-
Optimality criteria: parsimony, maximum likelihood, minimum-evolution, least-squares
-
Parsimony variants: Camin-Sokal, Wagner, Fitch, transversion, generalized (=weighted)
-
Substitution models: JC69, F81, K80, F84, HKY85, GTR, logdet/paralinear
-
Descriptive statistics: base frequencies, pairwise sequence comparisons
-
Manipulating data scope: include/exclude characters, delete/restore taxa, partitions (characters and taxa)
-
Statistical tests: KH test, homogeneity partition test, permutation tests, base frequency homogeneity test, likelihood ratio test of molecular clock
-
Nodal support measures: jackknife, bootstrap
-
Consensus methods: strict/semistrict/majority-rule/Adams consensus trees, agreement subtrees
-
Trees: generation of random trees, tree-to-tree distances
-
Other: Lake's invariants, plots of gamma distribution, likelihood surface check, ancestral state reconstruction, printing of trees
What can PAUP* not do?
Despite its completeness, there are a few things that PAUP* cannot do for you at the present time:
-
PAUP* does not allow tree editing (like Mesquite, MacClade, or TreeView)
-
PAUP* is not able to do maximum likelihood analyses on amino acid sequences
-
PAUP* does not provide codon models that allow you to take into account the codon structure of protein coding genes when analyzing nucleotide sequences (use PAML or HyPhy for this)
-
PAUP* does not perform Bayesian analyses (we will use MrBayes later on for this)
-
PAUP* (like almost all other phylogenetic analysis programs) assumes your sequences are already aligned; it will not align them for you, nor will it help you find sequences in GenBank or other databases.
Typographical conventions
In this and subsequent web pages, I will try to stick to the following typographical conventions:
-
New terms will look like this
-
Text that I want to emphasize will look like this
-
Command names or portions of commands that you might type into a program such as PAUP* will look like this
-
Keywords used in NEXUS files will look like this
-
Names of files will look like this
PAUP* tips
PAUP* is not finished at this point. For the most part, this is not a problem since you can purchase and use it just like a finished product. The primary drawback of PAUP*'s unfinished status is that there is currently not a complete manual for the program. On the
PAUP* Download Page you can find a PDF command summary and "Quick Start" tutorial; however, much of the explanatory portion of the manual is not present in any form. There are easy ways to obtain information from the program itself, however. Some of the tips listed below are concerned with getting the program to tell you what commands and command options are available.
Here are some tips to keep in mind while you use PAUP*. This list is not comprehensive; these are just some things that are not immediately apparent but which make your life easier once you know about them.
-
A command line can be made visible on the Mac version, and is always apparent on all other versions of PAUP*. This may not sound like a tip, but having a command line allows you to explore many of the other tips described below.
-
The help command provides a list of available commands. Often you can spot the command you need by looking at this list. Once you see a command name that looks promising, you can get a description of how to invoke the command like this:
help hsearch;
-
The ? option works for all commands and provides a list of the options for that command as well as the current default settings for those options. This is extremely useful! For example, this command would list all the current likelihood settings:
lset ?;
-
All PAUP* menu commands have command line equivalents. In this class we will only use the text commands rather than using graphical versions of PAUP*. There are significant benefits to using the command line interface. For example, you can put all the commands for an analysis in the data file itself (see section on PAUP blocks, below), allowing you to have a complete record of what you did (often very useful when a reviewer asks you to be more specific about how you performed your analysis!). PAUP blocks are also useful for making sure certain settings are always invoked when you execute the data file. Using the text commands (rather that pointing-and-clicking also preserves a record of the commandst that you have issued -- something what you tell a program to do is not what you meant to do, so seeing the record of actions is very useful).
-
PAUP* uses the NEXUS data file format. This is a fairly complex file format used by several programs that perform phylogenetic analyses (PAUP*, MacClade, TreeView, and Component, for example). It is described in more detail below, so here I will only point out that PAUP* can put your data in NEXUS format automatically if your data are in one of several recognized formats already. This is done with the toNEXUS command as follows:
toNEXUS fromfile=mydata.txt format=text tofile=mydata.nex;
toNEXUS fromfile=mydata.msf format=gcg tofile=mydata.nex;
toNEXUS fromfile=mydata.dat format=phylip tofile=mydata.nex;
The first of these commands converts a data file (mydata.txt) in plain text format (each sequence on a separate line, with the name first followed by the sequence after one or more blank spaces) to NEXUS format, storing it in a file named mydata.nex. The second line converts mydata.msf (GCG MSF format) into NEXUS format, again storing the resulting file as mydata.nex. The third line converts a PHYLIP formatted data file (mydata.dat) to NEXUS format.
Use the command toNEXUS ? to list other options, including other formats that can be converted.
-
PAUP* allows you to easily include and exclude sites, making it possible to leave primer sites, introns, and dubiously aligned regions in the data file even though you do not wish to include them in analyses. You can also include or exclude entire classes of sites using the keywords all, gapped, missambig, constant, and uninf. For example,
exclude gapped;
would exclude all sites containing a gap for at least one taxon (sequence). If you needed to exclude only 3rd. codon position sites, even this is easy: assuming that the first nucleotide site in each sequence corresponds to a 1st codon position, this command would exclude all the 3rd. position sites (the dot stands for the last nucleotide site in the sequence):
exclude 3-. \ 3;
This is how you include again all the sites you have excluded:
include all;
-
There are parallel commands for deleting and restoring taxa. The term taxon (plural taxa) refers to whatever forms the terminal nodes of your phylogeny (the taxa will be individual sequences in most cases for purposes of this course). Don't be confused by the command names delete and restore: these act just like exclude and include except they act on taxa and not characters (=sites). If the first, second and fourth taxa were named Thermus, Sulfolobus and Pyrococcus, you could tell PAUP* to ignore them in subsequent analyses using either of the two commands below:
delete 1 2 4;
delete Thermus Sulfolobus Pyrococcus;
This is how you would reinstate the taxa deleted above:
restore all;
-
For long runs, PAUP* reports progress once per minute in the form of a line written to the output buffer. To have PAUP* report once every two minutes, specify 120 seconds instead of the default of 60 seconds:
set dstatus=120;
-
By default, PAUP* does not save the output that is generated to a file. The output is stored in what is known as an output buffer. When this buffer becomes full, the first part will begin to be overwritten by newer output. Thus, one of the first things you should do when starting any serious analysis is to start a log file, using either the menu command or a command similar to this:
log file=myoutput.txt start replace
The keyword start means to start logging output, whereas stop means to stop logging output, and the replace keyword causes the file to be overwritten without warning if it exists. Using append instead of replace is safer: in this case, new output is added at the end of the file and none of the data already in the file is lost. If you do not use either replace or append, then PAUP* will ask you what you want to do (ok if you are sitting there watching it, but not so good if you have started a run and walked away from the computer for a long time!).
-
PAUP* almost always produces unrooted trees, however, the trees look rooted when PAUP* draws them! You can reroot the tree by specifying an outgroup taxon (or taxa); however, this does not change the fact that the analysis did not estimate or determine the root (you did this, either implicitly or explicitly). Here's how to tell PAUP* to always draw trees with Giardia as the outgroup:
outgroup Giardia;
The NEXUS Data File Format
NEXUS blocks
PAUP* uses a data file format known as
NEXUS. This file format is now shared among several programs. NEXUS data files always begin with the characters
#NEXUS but are otherwise organized into major units known as
blocks. Some blocks are recognized by most of the programs using the NEXUS file format, whereas other blocks are
private blocks (recognized by only one program). A NEXUS block has the following basic structure:
#NEXUS
...
begin characters;
...
end;
Note that the elipsis (...) is never used in a NEXUS data file; it is used here simply to indicate that some text has been omitted. The name of the NEXUS block used as an example above is
characters. Because NEXUS data files are organized in named blocks, PAUP* and other programs are able to read blocks whose names they recognize and ignore blocks that are not recognized. This allows many different programs to use the same overall format without crashing when they encounter data they cannot interpret.
NEXUS commands
Blocks are in turn organized into semicolon-terminated
commands.
It is very important that you remember to terminate all commands with a semicolon. This is especially hard to remember for very long commands. PAUP* is pretty good about pointing out forgotten semicolons, but sometimes it doesn't realize you've left something out until some distance downstream, which can make the problem point difficult to find. Some common commands will be provided below in the description of the common blocks.
NEXUS comments
Comments can be placed in a NEXUS file using square brackets. Comments can be placed anywhere, and they are used for many purposes. For example,
you can effectively remove some of your data by commenting it out. You can also annotate your sequences using comments. For example, a comment like that below is useful for locating specific sites in your alignment:
[----+--10|----+--20|----+--30|----+--40|----+--50|----]
Ephedra TTAAGCCATGCATGTCTAAGTATGAACTAATT-CAAACGGTGAAACTGCGGATG
Gnetum TTAAGCCATGCATGTCTATGTACGAACTAATC-AGAACGGTGAAACTGCGGATG
Welwitschia TTAAGCCATGCACGTGTAAGTATGAACTAGTC-GAAACGGTGAAACTGCGGATG
If you would like your comment printed out in the output when PAUP* executes the data file, just insert an exclamation point (!) as the first character inside the opening left square bracket:
[!This is the data file used for my dissertation]
Commonly-used NEXUS blocks
Here is a list of common NEXUS blocks and the most-common commands within these blocks. For a complete description of the NEXUS file format, take a look at this paper:
Maddison, David R., Swofford, David L. and Maddison, Wayne P. 1997. "NEXUS: an extensible file format for systematic information." Systematic Biology 46: 590-621
|
Taxa block
The purpose of a Taxa block is to provide names for your taxa (i.e., sequences). You may not use a Taxa block very often, since you can also supply names for your taxa directly in the Data block (see below). Here is an example of a Taxa block.
#NEXUS
...
begin taxa;
dimensions ntax=5;
taxlabels
Giardia
Thermus
Deinococcus
Sulfolobus
Haobacterium
;
end;
Note that there are four
commands in this example of a Taxa block. Can you find the terminating semicolon for each of them?
-
the begin command giving the block's name
-
the dimensions command giving the number of taxa
-
the taxlabels command providing the actual taxon labels
-
the end command, telling PAUP* that there are no more commands to process for this block
Data block
The
Data block is the workhorse of NEXUS blocks. This is where you place the actual sequence data, and, as mentioned above, this can also be where you define the names of your sequences. Here is an example of a
Data block:
#NEXUS
...
begin data;
dimensions ntax=5 nchar=54;
format datatype=dna missing=? gap=-;
matrix
Ephedra TTAAGCCATGCATGTCTAAGTATGAACTAATTCCAAACGGTGAAACTGCGGATG
Gnetum TTAAGCCATGCATGTCTATGTACGAACTAATC-AGAACGGTGAAACTGCGGATG
Welwitschia TTAAGCCATGCACGTGTAAGTATGAACTAGTC-GAAACGGTGAAACTGCGGATG
Ginkgo TTAAGCCATGCATGTGTAAGTATGAACTCTTTACAGACTGTGAAACTGCGAATG
Pinus TTAAGCCATGCATGTCTAAGTATGAACTAATTGCAGACTGTGAAACTGCGGATG
[----+--10|----+--20|----+--30|----+--40|----+--50|----]
;
end;
Some things to note in this example are:
-
The dimensions command comes first in a Data block, and specifies the number of sequences (taxa; ntax) and number of sites (characters; nchar).
-
The format command tells PAUP* what kind of data follow (dna, rna, protein, or standard), and provides the symbols used for missing data (?) and gaps (-).
-
The matrix command dominates the Data block, providing the sequences themselves (as well as the taxon names). Note the semicolon terminating the matrix command!!!
-
You can use upper or lower case symbols for nucleotides
-
You can place whitespace anywhere except inside a taxon name or keyword (e.g., data type = dna would cause problems because datatype should not have embedded whitespace).
-
If you simply must have a space in one of your taxon names, either use an underscore character in place of the space (e.g., Ginkgo_biloba) or surround the taxon name in single quotes (e.g., 'Ginkgo biloba'). In either case, PAUP* will output the space in its output.
-
One item missing from the format command in the example above but which is quite useful is something known as an equate list. The following format statement will cause all occurrences of T to be changed to C and all occurrences of G to be changed to A as the data are being read into PAUP*:
format datatype=dna missing=? gap=- equate="T=C G=A";
This is like telling PAUP* to do a search-and-replace operation on the sequences before reading them in, except that your original file remains intact. Be careful when using equate, because the replacement is case sensitive (i.e., equate="t=c g=a" would have had no effect if all the nucleotides are represented by upper case letters!).
-
PAUP* recognizes all the standard ambiguity codes (e.g., R for purine, Y for pyrimidine, N for undetermined, etc.).
Trees block
A Trees block has the following structure:
#NEXUS
...
begin trees;
translate
1 Ephedra,
2 Gnetum,
3 Welwitschia,
4 Ginkgo,
5 Pinus
;
tree one = [&U] (1,2,(3,(4,5));
tree two = [&U] (1,3,(5,(2,4));
end;
Some things to note in this example are:
-
The translate command provides short alternatives to the taxon names, making the tree descriptions shorter (takes up fewer bytes of disk space).
-
the translate command is not necessary however; it is ok to use the taxon names directly in the tree descriptions
-
the tree command denotes the start of a tree description, which consists of a tree name (e.g., one and two are used here), followed by an equals sign and then the tree topology in the standard, parenthetical notation (often referred to as the Newick or New Hampshire format).
-
The special comments consisting of an ampersand symbol followed by the letter U tell PAUP* to interpret the tree as being an unrooted tree.
-
Files containing only the #NEXUS plus a trees block are called tree files
Sets block
The only commands you need to know at this point from a
sets block are the
charset and the
taxset commands.
#NEXUS
...
begin sets;
charset trnL_intron = 562-4226;
taxset gnetales = Ephedra Gnetum Welwitschia;
end;
This sets block defines both a set of characters (in this case the sites composing the trnL intron) and a set of taxa (consisting of the three genera in the seed plant order Gnetales: Ephedra, Gnetum and Welwitschia). We could have used the taxon numbers for the taxset definition (e.g., taxset gnetales = 1-3;) but using the actual names is clearer and less prone to error (just think of what might happen if you decided to reorder your sequences!). These definitions may be used in other blocks. A common use is in commands placed inside a paup block (see below) or typed directly at the PAUP* command prompt.
Assumptions block
There is only one command I will introduce from the
assumptions block (although there are a number of others that exist). The
exset command (the word
exset stands for
exclusion set) is useful for creating a set of characters that are automatically excluded whenever the data file is executed. Given the following block:
#NEXUS
...
begin assumptions;
exset* badsites = 1 5 47-.;
end;
PAUP* would automatically exclude characters (i.e., sites) 1, 5, and 47 through the end of the sequence. It is the asterisk after the newterm exset that denotes this as the default exclusion set. If you left out the asterisk, PAUP* would define the exclusion set but would not automatically exclude these sites as the data file was being executed.
Paup blocks provide a way to give PAUP* commands from within a data file itself. Any command you can type at the command prompt or perform using menu commands you can place in the data file. This allows you to specify an entire analysis right in the data file. For any serious analysis, I always run PAUP* using a paup block. That way I know exactly what I did for a given analysis several days or weeks in the future. Paup blocks are also a handy way to perform certain commands every time the data file is executed. For example, you can set up your favorite likelihood substitution model, delete certain taxa or exclude certain sites from a paup block located just after your data block. Here is an example of a typical paup block:
#NEXUS
...
begin paup;
log file=myoutput.txt start stop;
outgroup Ephedra;
set criterion=likelihood;
lset nst=2 basefreq=empir rates=equal tratio=estim variant=hky;
hsearch swap=tbr addseq=random nreps=100 start=stepwise;
describe 1 / plot=phylogram;
savetrees file=mytrees.tre brlens;
log stop;
quit;
end;
Here is what each line does (but don't worry too much about this since we will be talking much more about individual commands later in lab):
-
The log command starts a log file (the file will be called myoutput.txt and will be overwritten if it already exists)
-
The outgroup command specifies that the resulting trees should be rooted between Ephedra and everything else (this just affects the appearance of the tree when drawn)
-
The set command changes the optimality criterion from the default (parsimony) to maximum likelihood
-
The lset command sets up PAUP* so that the HKY85 model will be used (number of substitution rates is 2, empirical base frequencies, rates are homogeneous across sites, estimate the transition/transversion ratio, and use the HKY model rather than the other, similar F84 model)
-
The hsearch command causes PAUP* to conduct 100 heuristic searches (each beginning from a different, random starting tree); each search will start with a stepwise addition tree using random addition of taxa, and this starting tree will be rearranged using the tree bisection/reconnection branch swapping method
-
The describe command produces a depiction of the tree (rooted at the specified outgroup) on the output (and in the log file, since we opened a log file earlier); the tree will be shown as a phylogram, which means branch lengths will appear proportional to the average number of nucleotide substitutions per site that were inferred for that branch.
-
The savetrees command saves the best tree found during the search (this is quite important and easy to forget to do!). The brlens keyword tells PAUP to save branch length information along with the tree topology.
-
The log command stops the logging of output to the file myoutput.txt
-
The quit command causes PAUP* to quit running; if you left out this command, PAUP* would remain running at this point, allowing you to issue other commands
Note that because PAUP* ignores blocks whose names it does not recognize, you can easily "comment out" a paup block by simply adding a character to its name. For example, adding an underscore
#NEXUS
...
begin _paup;
.
.
.
end;
is enough to cause PAUP* to completely ignore this paup block. This is handy because it allows you to create multiple paup blocks for different purposes and turn them off and on whenever you need them.
You can also "comment out" a portion of a paup block using the leave command. For example, in this paup block, PAUP* will be set up for doing a likelihood analysis but will not actually conduct the search; the leave command causes PAUP* to exit the block early:
#NEXUS
...
begin paup;
log file=myoutput.txt start stop;
outgroup Ephedra;
set criterion=likelihood;
lset nst=2 basefreq=empir rates=equal tratio=estim variant=hky;
leave;
hsearch swap=tbr addseq=random nreps=100 start=stepwise;
describe 1 / plot=phylogram;
savetrees file=mytrees.tre brlens;
log stop;
quit;
end;
Today's PAUP* lab exercise
First, a note about characters blocks versus data blocks: the characters block is essentially a new and improved version of the data block. Feel free to use either one, but be aware that programs such as PAUP* may eventually stop using the data block since the characters block accomplishes the same thing and has features missing in the data block. To convert a data block to a characters block, just change the block name and add the keyword newtaxa to the dimensions command just before the keyword ntax. This tells PAUP* that you will be defining the names of your taxa in the characters block itself (rather than in a preceding taxa block).
Questions that should be answered (or excercises that you should do on your own) appear in this style. There is no need to turn in your answers to these exercises. It is up to you to make sure you are comfortable with this material. Please ask questions if anything is unclear. While it is possible to do these exercises outside of the scheduled lab time, working through them in lab is better because we are here to help with questions that arise.
-
First create a folder.
-
Copy the angio35.txt file from the data folder into your own newly-created folder. (If you are not in the computer lab, you can download the file by right-clicking here and using your browser's menu option)
-
Open the angio35.txt file in a text editor (not microsoft word). Note that the file is not in NEXUS format yet. It contains 35 DNA sequences. These are rbcL gene sequences from various green plants. The important thing to notice is that the format is quite simple: each line consists of a taxon name followed by at least one blank space, which is followed by the sequence for that taxon. Note that the blank space is important: taxon names cannot contain embedded spaces, because spaces are used to separate taxon names from the corresponding sequences.
-
Launch PAUP* from within the directory that contains the angio35.txt file and type in the following command:
toNEXUS from=angio35.txt to=angio35.nex datatype=nucleotide format=text;
After the conversion, the file angio35.nex should be present. Open this NEXUS file in a text editor to see what PAUP* did to convert the original file to NEXUS format. Do not execute the file in PAUP* just yet because there are some additions we need to make before it is ready for analyzing.
-
Create an assumptions block containing a default exclusion set that excludes the following sites automatically whenever the data file is executed. This should be added to the bottom of the newly-created NEXUS file (i.e., after the data). Save the file as plain text.
begin assumptions;
exset * unused = 1-30 2125-2218 2219-2235 2305-2307 5058-5063 4615-4743 5064-5093;
end;
These numbers represent nucleotide sites that either are missing a lot of data or are difficult to align. The name I gave to this exclusion set is unused, but you could name it anything you like. The asterisk tells PAUP* that you want this exset applied (i.e. you want these sites excluded) every time the file is executed.
-
Create a sets block comprising the following four charset commands:
-
The first charset should be named nad4 and include sites 1 through 2235
-
The second charset should be named rps4 and include sites 2236 through 2985
-
The second charset should be named rbcL and include sites 2986 through 4391
-
The third charset should be named other and include sites with numbers above 4391
This block should be placed after the assumptions block. Look at the description above of the sets block and try to do this part on your own.
-
Now execute the data file by invoking paup and typing the execute command:
execute angio35.nex ;
If your assumptions block is correct, the output should include a statement saying that 309 characters have been excluded. If you set up your sets block correctly you should be able to enter this command: exclude all;
include rbcL;
and get no errors. In addition, PAUP* should tell you that 5093 characters were excluded (as a result of the exclude all command) and 1406 were re-included (as a result of the include rbcL command). For the rest of the exercise, we will be working with the data from the first 3 genes, so re-include the nad4 and rps4 data:
include nad4 rps4;
PAUP* should now say that there are a total of 4391 included characters.
-
The first item of business in starting an analysis in PAUP* is to begin logging the output to a file. The following command will begin saving all output to the file output.txt. Note that we have chosen to automatically replace the file if it already exists. If you are nervous about this (and would rather have PAUP* ask before overwriting an existing file), either leave off the replace keyword or substitute append, which tells PAUP* to simply add new output to the end of the file if it already exists.
log file=output.txt start replace;
-
Type set ? to get a listing of the general settings. PAUP* has four "settings" commands: set for general settings; pset for settings specifically related to parsimony; lset for settings specifically related to likelihood; and dset for settings specifically related to distance methods.
From the output of the set command, can you determine which optimality criterion PAUP* would use if we were to do a search at this point?
-
To perform a parsimony search, first try the alltrees command. This command asks PAUP* to calculate the optimality criterion for every possible tree
alltrees;
Did PAUP* allow you to perform an exhaustive search for 35 taxa?
-
Now try heuristic searching. This approach does not attempt to look at all possible trees, but instead only examines trees that are in the realm of possibility (which can still be a lot of trees!):
hsearch;
The search progress will be displayed in a dialog box. When the button says Close rather than Stop, take a look at the numbers summarizing this search. What is the parsimony score of the best tree found during the search? (Write down this score somewhere for later reference.) How many trees were examined (look at # Rearrangements tried)?
-
Now you probably want to take a look at the tree that PAUP* found and is now holding in memory. First, however, choose an outgroup taxon so that the (unrooted) tree will be drawn in a way that looks like it is rooted in a reasonable place:
outgroup Funaria_hygrometrica ;
showtree;
To make the tree appear to flow downward, which is more pleasing to the eye, tell PAUP* that you would like to use the tree order "right" (this is also commonly known as "ladderizing right"):
set torder=right;
showtree;
Before doing anything else, we should save this tree in a file so that it will be available later, perhaps for viewing or printing in TreeView. Let's call the treefile pars.tre. The brlens keyword in the command below tells PAUP* that you want to save the branch lengths as well as just the tree topology (almost always a good option to include):
savetrees file=pars.tre brlens;
-
You may have noticed that PAUP* found 10 most-parsimonious trees. These 10 trees are all indistinguishable using the parsimony criterion. Let's now use the likelihood criterion to evaluate these 10 trees:
set criterion=likelihood;
lscores all;
These commands ask PAUP* to simply evaluate the likelihood score of the trees in memory. Note that because we arrived at these trees using parsimony, it is quite possible that none of these trees represents the maximum likelihood tree. That is, we may be able to find better trees under the likelihood criterion if we performed a search using the likelihood criterion. What is the likelihood score of the best tree? (As for parsimony, write this number down for later comparison.) Is the likelihood score the same for all 10 trees? Which tree is best? Important: PAUP* reports the negative of the natural logarithm of the likelihood score. This means that smaller numbers are better, as smaller numbers represent higher likelihoods.
-
Next, we will obtain a neighbor-joining tree. Neighbor-joining (NJ for short) is one of the algorithmic methods: that is, it uses an optimality criterion (the minimum evolution criterion) at each step of the algorithm, but in the end produces a tree without actually examining many trees:
nj;
-
Let's see how the NJ tree compares to the tree found by parsimony. First, use the lscores command to compute the log-likelihood of the NJ tree:
lscores all;
Now compute the parsimony score of the NJ tree using the pscores command:
pscores all;
According to the parsimony criterion, is the NJ tree better than any of the trees found by parsimony? According to the likelihood criterion, is the NJ tree better than the best tree you have found thus far?
-
That's all for today in PAUP:
log stop;
quit;
Now we'll look at Mesquite a bit.
Mesquite
Mesquite is a graphical system for the visualization and analysis of comparative data in a phylogenetic context. The home page is mesquiteproject.org
Wayne and David Maddison are the lead authors, but Mesquite was designed to be modular and encourage community input (in fact one of the leading contributors is Dr. Peter Midford, who is currently at KU). Mesquite is free, open-source and written in Java so that it should run on any modern computer. There are some analyses of character data that can only be performed in Mesquite, and a large set of features that are available in other tools such as MacClade. I strongly encourage you to look at the feature list on the Mesquite features page to get a feeling for the types of analyses that Mesquite can perform.
One of Mesquites great strengths is that it serves as a user-friendly editor for phylogenetic data. The underlying file format that Mesquite uses is the NEXUS format that is used by PAUP* and MrBayes. So you can use Mesquite as a way of getting your data ready for phylogenetic analysis (unfortunately every program tends to speak a slightly different dialect of NEXUS, so some hand-editing is often required).
Because it is a graphical program it is very tedious to describe how to perform actions in Mesquite in detail. Rather than say:
"Go to the Taxa&Trees menu. Choose the New Tree Window. Click on the Default Trees choice. Hit the OK button."
I will use an abbreviated syntax like this:
Taxa&Trees >> New Tree Window >> Default Trees
Note that you cannot tell by looking at that instruction which steps are menu options and which are options in a dialog box. I also will omit the "click OK" step. If you start looking for the command in the menu's of Mesquite and follow through with each step of instruction, it should be fairly easy to interpret these instructions
-
Download Mesquite and launch it by double-clicking on the Mesquite icon.
-
Copy the file angio35.txt to original-angio35.txt so that if Mesquite modifies the file in an unexpected way we can get back the original data.
-
Open the file using File >> Open File...
-
You should see the "project view" of the file which will have icons for various views of the data for instance you can manipulate data about the taxa by clicking on the "Taxa/List & Manage Taxa" button. Look at the character matrix by clicking on the "Show Matrix" button. Familiarize yourself with the matrix editing tools that show up in the toolbar next to the matrix. When you mouse-over a button in the toolbar a description of the tool should appear in the status box at the bottom of the window.
Note that there are 2 toolbars: one to the left of the matrix and one under the taxa labels.
Can you add an annotation saying "I don't trust this basecall" to the cell that corresponds to the tenth character and the 11th taxon?
-
Go back to the Project tab and look at the "List & Manage Characters" view.
Can you tell which characters were excluded in the exset of the file?
-
We created some trees in PAUP*, but those were saved to another file. Let's see if we can tell Mesquite that the trees are associated with this data. Use:
Taxa&Trees >> Import File With Trees >> Link Contents... and the browse to the pars.tre file. This should result in a Trees panel in the project tab. By selecting "Show Trees" you should be able to examine the trees in a new window.
-
Use Analysis >> Values for Current Tree... >> Treelength to make the tree window display the number of steps required to explain the tree.
Does the tree length agree with the parsimony score in PAUP? If not, why not?
-
Is the tree rooted on the outgroup (Funaria hygrometrica)? If not, why not?
Get familiar with the tools in the tree window's tool bar. Find the rerooting tool and use it to redraw the tree such that Funaria hygrometrica is the outgroup.
Did the tree length change when you rerooted the tree?
-
To look at a most parsimonious reconstruction of the character states, we can use Analysis >> Trace Character History >> Parsimony Ancestral Character States . The tree should now be colored (something other than black). The legend in the bottom corner of the window shows how the colors map to the states, what character is being traced, and the number of steps required to explain this character. Find a variable character and look at several rerootings of the tree.
Does the tree length for the character ever change?
Find a character with homoplasy in it and use the "Move Branch" tool to rearrang the tree so that there is no homoplasy in that character
What happened to the total tree length when you made the changes?
- Quit Mesquite and save the changes to the file (you are welcome to continue to check out Mesquite, but we are done with it for today's lab).
- Open angio35.nex and look at the changes that Mesquite made.
Can you find the annotations that you made about the cell of the character matrix? Do you think that these changes will interfere with PAUP reading the file? Were the trees from the parsimony analysis included in the file?
- Start PAUP* again. Execute the angio35.nex file and then use paup's GetTrees command to get the trees from the pars.tre file. Calculate the parsimony scores of the trees (if you changed them in Mesquite then they will not all be the same score as they were before you ran Mesquite.