BIOL 848: Phylogenetic Methods

Instructor: Mark Holder

This lab was written by Paul O. Lewis and being used in BIOL 848 (with slight modifications) by Mark Holder with Paul's permission (the original lab is here). Thanks, Paul!

Lab 2: Introduction to PAUP* and the NEXUS data file format

Contents

Introduction to PAUP* 4.0

PAUP* 4.0 is the successor to PAUP 3.1, which was published in 1993 by David L. Swofford (currently at the Duke University's Institute for Genome Sciences and Policy and NESCent). The name PAUP means Phyogenetic Analysis Using Parsimony because parsimony was the only optimality criterion employed at the time. The asterisk in the name PAUP* means and other methods. PAUP* is one of the most comprehensive phylogenetic analysis computer programs available, and we will use PAUP to illustrate many of the phylogenetic methods we talk about during the course.

PAUP* Home Page

The PAUP* Home Page is the best place to go for up-to-date information about program availability, known problems/workarounds, and help in the form of a FAQ and electronic forum. As of this writing, PAUP* is being sold by Sinauer Associates (price varies according to platform). While it is not a free program, you really do get a lot for your money compared to most other commercial software, as the next section is designed to illustrate.

What can PAUP* do?

PAUP* is capable of performing many of the types of phylogenetic analyses The following listing is not exhaustive, but is designed to give you an idea of what PAUP* can currently do (we'll cover most of these methods later in class):

What can PAUP* not do?

Despite its completeness, there are a few things that PAUP* cannot do for you at the present time:

Typographical conventions

In this and subsequent web pages, I will try to stick to the following typographical conventions:

PAUP* tips

PAUP* is not finished at this point. For the most part, this is not a problem since you can purchase and use it just like a finished product. The primary drawback of PAUP*'s unfinished status is that there is currently not a complete manual for the program. On the PAUP* Download Page you can find a PDF command summary and "Quick Start" tutorial; however, much of the explanatory portion of the manual is not present in any form. There are easy ways to obtain information from the program itself, however. Some of the tips listed below are concerned with getting the program to tell you what commands and command options are available.

Here are some tips to keep in mind while you use PAUP*. This list is not comprehensive; these are just some things that are not immediately apparent but which make your life easier once you know about them.

The NEXUS Data File Format

NEXUS blocks

PAUP* uses a data file format known as NEXUS. This file format is now shared among several programs. NEXUS data files always begin with the characters #NEXUS but are otherwise organized into major units known as blocks. Some blocks are recognized by most of the programs using the NEXUS file format, whereas other blocks are private blocks (recognized by only one program). A NEXUS block has the following basic structure:
#NEXUS
...
begin characters;
...
end;
Note that the elipsis (...) is never used in a NEXUS data file; it is used here simply to indicate that some text has been omitted. The name of the NEXUS block used as an example above is characters. Because NEXUS data files are organized in named blocks, PAUP* and other programs are able to read blocks whose names they recognize and ignore blocks that are not recognized. This allows many different programs to use the same overall format without crashing when they encounter data they cannot interpret.

NEXUS commands

Blocks are in turn organized into semicolon-terminated commands. It is very important that you remember to terminate all commands with a semicolon. This is especially hard to remember for very long commands. PAUP* is pretty good about pointing out forgotten semicolons, but sometimes it doesn't realize you've left something out until some distance downstream, which can make the problem point difficult to find. Some common commands will be provided below in the description of the common blocks.

NEXUS comments

Comments can be placed in a NEXUS file using square brackets. Comments can be placed anywhere, and they are used for many purposes. For example, you can effectively remove some of your data by commenting it out. You can also annotate your sequences using comments. For example, a comment like that below is useful for locating specific sites in your alignment:

            [----+--10|----+--20|----+--30|----+--40|----+--50|----]
Ephedra      TTAAGCCATGCATGTCTAAGTATGAACTAATT-CAAACGGTGAAACTGCGGATG
Gnetum       TTAAGCCATGCATGTCTATGTACGAACTAATC-AGAACGGTGAAACTGCGGATG
Welwitschia  TTAAGCCATGCACGTGTAAGTATGAACTAGTC-GAAACGGTGAAACTGCGGATG
If you would like your comment printed out in the output when PAUP* executes the data file, just insert an exclamation point (!) as the first character inside the opening left square bracket:

[!This is the data file used for my dissertation]

Commonly-used NEXUS blocks

Here is a list of common NEXUS blocks and the most-common commands within these blocks. For a complete description of the NEXUS file format, take a look at this paper:

Maddison, David R., Swofford, David L. and Maddison, Wayne P. 1997. "NEXUS: an extensible file format for systematic information." Systematic Biology 46: 590-621

Taxa block

The purpose of a Taxa block is to provide names for your taxa (i.e., sequences). You may not use a Taxa block very often, since you can also supply names for your taxa directly in the Data block (see below). Here is an example of a Taxa block.
#NEXUS
...
begin taxa;
  dimensions ntax=5;
  taxlabels 
    Giardia
    Thermus
    Deinococcus
    Sulfolobus
    Haobacterium
  ;
end;
Note that there are four commands in this example of a Taxa block. Can you find the terminating semicolon for each of them?

Data block

The Data block is the workhorse of NEXUS blocks. This is where you place the actual sequence data, and, as mentioned above, this can also be where you define the names of your sequences. Here is an example of a Data block:
#NEXUS
...
begin data;
  dimensions ntax=5 nchar=54;
  format datatype=dna missing=? gap=-;
  matrix
    Ephedra       TTAAGCCATGCATGTCTAAGTATGAACTAATTCCAAACGGTGAAACTGCGGATG
    Gnetum        TTAAGCCATGCATGTCTATGTACGAACTAATC-AGAACGGTGAAACTGCGGATG
    Welwitschia   TTAAGCCATGCACGTGTAAGTATGAACTAGTC-GAAACGGTGAAACTGCGGATG
    Ginkgo        TTAAGCCATGCATGTGTAAGTATGAACTCTTTACAGACTGTGAAACTGCGAATG
    Pinus         TTAAGCCATGCATGTCTAAGTATGAACTAATTGCAGACTGTGAAACTGCGGATG
                 [----+--10|----+--20|----+--30|----+--40|----+--50|----]
  ;
end;

Some things to note in this example are:

Trees block

A Trees block has the following structure:
#NEXUS
...
begin trees;
  translate
    1 Ephedra,
    2 Gnetum,
    3 Welwitschia,
    4 Ginkgo,
    5 Pinus
  ;
  tree one = [&U] (1,2,(3,(4,5));
  tree two = [&U] (1,3,(5,(2,4));
end;

Some things to note in this example are:

Sets block

The only commands you need to know at this point from a sets block are the charset and the taxset commands.
#NEXUS
...
begin sets;
  charset trnL_intron = 562-4226;
  taxset gnetales = Ephedra Gnetum Welwitschia;
end;

This sets block defines both a set of characters (in this case the sites composing the trnL intron) and a set of taxa (consisting of the three genera in the seed plant order Gnetales: Ephedra, Gnetum and Welwitschia). We could have used the taxon numbers for the taxset definition (e.g., taxset gnetales = 1-3;) but using the actual names is clearer and less prone to error (just think of what might happen if you decided to reorder your sequences!). These definitions may be used in other blocks. A common use is in commands placed inside a paup block (see below) or typed directly at the PAUP* command prompt.

Assumptions block

There is only one command I will introduce from the assumptions block (although there are a number of others that exist). The exset command (the word exset stands for exclusion set) is useful for creating a set of characters that are automatically excluded whenever the data file is executed. Given the following block:
#NEXUS
...
begin assumptions;
  exset* badsites = 1 5 47-.;
end;

PAUP* would automatically exclude characters (i.e., sites) 1, 5, and 47 through the end of the sequence. It is the asterisk after the newterm exset that denotes this as the default exclusion set. If you left out the asterisk, PAUP* would define the exclusion set but would not automatically exclude these sites as the data file was being executed.

Paup block

Paup blocks provide a way to give PAUP* commands from within a data file itself. Any command you can type at the command prompt or perform using menu commands you can place in the data file. This allows you to specify an entire analysis right in the data file. For any serious analysis, I always run PAUP* using a paup block. That way I know exactly what I did for a given analysis several days or weeks in the future. Paup blocks are also a handy way to perform certain commands every time the data file is executed. For example, you can set up your favorite likelihood substitution model, delete certain taxa or exclude certain sites from a paup block located just after your data block. Here is an example of a typical paup block:
#NEXUS
...
begin paup;
  log file=myoutput.txt start stop;
  outgroup Ephedra;
  set criterion=likelihood;
  lset nst=2 basefreq=empir rates=equal tratio=estim variant=hky;
  hsearch swap=tbr addseq=random nreps=100 start=stepwise; 
  describe 1 / plot=phylogram;
  savetrees file=mytrees.tre brlens;
  log stop;
  quit;  
end;

Here is what each line does (but don't worry too much about this since we will be talking much more about individual commands later in lab):

Note that because PAUP* ignores blocks whose names it does not recognize, you can easily "comment out" a paup block by simply adding a character to its name. For example, adding an underscore

#NEXUS
...
begin _paup;
.
.
.
end;

is enough to cause PAUP* to completely ignore this paup block. This is handy because it allows you to create multiple paup blocks for different purposes and turn them off and on whenever you need them.

You can also "comment out" a portion of a paup block using the leave command. For example, in this paup block, PAUP* will be set up for doing a likelihood analysis but will not actually conduct the search; the leave command causes PAUP* to exit the block early:

#NEXUS
...
begin paup;
  log file=myoutput.txt start stop;
  outgroup Ephedra;
  set criterion=likelihood;
  lset nst=2 basefreq=empir rates=equal tratio=estim variant=hky; 
  leave;
  hsearch swap=tbr addseq=random nreps=100 start=stepwise; 
  describe 1 / plot=phylogram;
  savetrees file=mytrees.tre brlens;
  log stop;
  quit;  
end;

Today's PAUP* lab exercise

First, a note about characters blocks versus data blocks: the characters block is essentially a new and improved version of the data block. Feel free to use either one, but be aware that programs such as PAUP* may eventually stop using the data block since the characters block accomplishes the same thing and has features missing in the data block. To convert a data block to a characters block, just change the block name and add the keyword newtaxa to the dimensions command just before the keyword ntax. This tells PAUP* that you will be defining the names of your taxa in the characters block itself (rather than in a preceding taxa block).

Questions that should be answered (or excercises that you should do on your own) appear in this style. There is no need to turn in your answers to these exercises. It is up to you to make sure you are comfortable with this material. Please ask questions if anything is unclear. While it is possible to do these exercises outside of the scheduled lab time, working through them in lab is better because we are here to help with questions that arise.

  1. First create a folder.
  2. Copy the angio35.txt file from the data folder into your own newly-created folder. (If you are not in the computer lab, you can download the file by right-clicking here and using your browser's Save Target As... menu option)
  3. Open the angio35.txt file in a text editor (not microsoft word). Note that the file is not in NEXUS format yet. It contains 35 DNA sequences. These are rbcL gene sequences from various green plants. The important thing to notice is that the format is quite simple: each line consists of a taxon name followed by at least one blank space, which is followed by the sequence for that taxon. Note that the blank space is important: taxon names cannot contain embedded spaces, because spaces are used to separate taxon names from the corresponding sequences.
  4. Launch PAUP* from within the directory that contains the angio35.txt file and type in the following command:
    toNEXUS from=angio35.txt to=angio35.nex datatype=nucleotide format=text;

    After the conversion, the file angio35.nex should be present. Open this NEXUS file in a text editor to see what PAUP* did to convert the original file to NEXUS format. Do not execute the file in PAUP* just yet because there are some additions we need to make before it is ready for analyzing.

  5. Create an assumptions block containing a default exclusion set that excludes the following sites automatically whenever the data file is executed. This should be added to the bottom of the newly-created NEXUS file (i.e., after the data). Save the file as plain text.
    begin assumptions;
      exset * unused = 1-30 2125-2218 2219-2235 2305-2307 5058-5063 4615-4743 5064-5093;
    end;

    These numbers represent nucleotide sites that either are missing a lot of data or are difficult to align. The name I gave to this exclusion set is unused, but you could name it anything you like. The asterisk tells PAUP* that you want this exset applied (i.e. you want these sites excluded) every time the file is executed.

  6. Create a sets block comprising the following four charset commands:
    • The first charset should be named nad4 and include sites 1 through 2235
    • The second charset should be named rps4 and include sites 2236 through 2985
    • The second charset should be named rbcL and include sites 2986 through 4391
    • The third charset should be named other and include sites with numbers above 4391

    This block should be placed after the assumptions block. Look at the description above of the sets block and try to do this part on your own.

  7. Now execute the data file by invoking paup and typing the execute command:
    execute angio35.nex ;
    If your assumptions block is correct, the output should include a statement saying that 309 characters have been excluded. If you set up your sets block correctly you should be able to enter this command:
    exclude all;
    include rbcL;

    and get no errors. In addition, PAUP* should tell you that 5093 characters were excluded (as a result of the exclude all command) and 1406 were re-included (as a result of the include rbcL command). For the rest of the exercise, we will be working with the data from the first 3 genes, so re-include the nad4 and rps4 data:

    include nad4 rps4;

    PAUP* should now say that there are a total of 4391 included characters.

  8. The first item of business in starting an analysis in PAUP* is to begin logging the output to a file. The following command will begin saving all output to the file output.txt. Note that we have chosen to automatically replace the file if it already exists. If you are nervous about this (and would rather have PAUP* ask before overwriting an existing file), either leave off the replace keyword or substitute append, which tells PAUP* to simply add new output to the end of the file if it already exists.
    log file=output.txt start replace;
  9. Type set ? to get a listing of the general settings. PAUP* has four "settings" commands: set for general settings; pset for settings specifically related to parsimony; lset for settings specifically related to likelihood; and dset for settings specifically related to distance methods.

    From the output of the set command, can you determine which optimality criterion PAUP* would use if we were to do a search at this point?

  10. To perform a parsimony search, first try the alltrees command. This command asks PAUP* to calculate the optimality criterion for every possible tree
    alltrees;

    Did PAUP* allow you to perform an exhaustive search for 35 taxa?

  11. Now try heuristic searching. This approach does not attempt to look at all possible trees, but instead only examines trees that are in the realm of possibility (which can still be a lot of trees!):
    hsearch;

    The search progress will be displayed in a dialog box. When the button says Close rather than Stop, take a look at the numbers summarizing this search. What is the parsimony score of the best tree found during the search? (Write down this score somewhere for later reference.) How many trees were examined (look at # Rearrangements tried)?

  12. Now you probably want to take a look at the tree that PAUP* found and is now holding in memory. First, however, choose an outgroup taxon so that the (unrooted) tree will be drawn in a way that looks like it is rooted in a reasonable place:
    outgroup Funaria_hygrometrica ;
    showtree;

    To make the tree appear to flow downward, which is more pleasing to the eye, tell PAUP* that you would like to use the tree order "right" (this is also commonly known as "ladderizing right"):

    set torder=right;
    showtree;

    Before doing anything else, we should save this tree in a file so that it will be available later, perhaps for viewing or printing in TreeView. Let's call the treefile pars.tre. The brlens keyword in the command below tells PAUP* that you want to save the branch lengths as well as just the tree topology (almost always a good option to include):

    savetrees file=pars.tre brlens;
  13. You may have noticed that PAUP* found 10 most-parsimonious trees. These 10 trees are all indistinguishable using the parsimony criterion. Let's now use the likelihood criterion to evaluate these 10 trees:
    set criterion=likelihood;
    lscores all;

    These commands ask PAUP* to simply evaluate the likelihood score of the trees in memory. Note that because we arrived at these trees using parsimony, it is quite possible that none of these trees represents the maximum likelihood tree. That is, we may be able to find better trees under the likelihood criterion if we performed a search using the likelihood criterion. What is the likelihood score of the best tree? (As for parsimony, write this number down for later comparison.) Is the likelihood score the same for all 10 trees? Which tree is best? Important: PAUP* reports the negative of the natural logarithm of the likelihood score. This means that smaller numbers are better, as smaller numbers represent higher likelihoods.

  14. Next, we will obtain a neighbor-joining tree. Neighbor-joining (NJ for short) is one of the algorithmic methods: that is, it uses an optimality criterion (the minimum evolution criterion) at each step of the algorithm, but in the end produces a tree without actually examining many trees:
    nj;
  15. Let's see how the NJ tree compares to the tree found by parsimony. First, use the lscores command to compute the log-likelihood of the NJ tree:
    lscores all;
    Now compute the parsimony score of the NJ tree using the pscores command:
    pscores all;
    

    According to the parsimony criterion, is the NJ tree better than any of the trees found by parsimony? According to the likelihood criterion, is the NJ tree better than the best tree you have found thus far?

  16. That's all for today in PAUP:
    log stop;
    quit;
    Now we'll look at Mesquite a bit.

Mesquite

Mesquite is a graphical system for the visualization and analysis of comparative data in a phylogenetic context. The home page is mesquiteproject.org

Wayne and David Maddison are the lead authors, but Mesquite was designed to be modular and encourage community input (in fact one of the leading contributors is Dr. Peter Midford, who is currently at KU). Mesquite is free, open-source and written in Java so that it should run on any modern computer. There are some analyses of character data that can only be performed in Mesquite, and a large set of features that are available in other tools such as MacClade. I strongly encourage you to look at the feature list on the Mesquite features page to get a feeling for the types of analyses that Mesquite can perform.

One of Mesquites great strengths is that it serves as a user-friendly editor for phylogenetic data. The underlying file format that Mesquite uses is the NEXUS format that is used by PAUP* and MrBayes. So you can use Mesquite as a way of getting your data ready for phylogenetic analysis (unfortunately every program tends to speak a slightly different dialect of NEXUS, so some hand-editing is often required).

Because it is a graphical program it is very tedious to describe how to perform actions in Mesquite in detail. Rather than say:
"Go to the Taxa&Trees menu. Choose the New Tree Window. Click on the Default Trees choice. Hit the OK button."

I will use an abbreviated syntax like this:
Taxa&Trees >> New Tree Window >> Default Trees
Note that you cannot tell by looking at that instruction which steps are menu options and which are options in a dialog box. I also will omit the "click OK" step. If you start looking for the command in the menu's of Mesquite and follow through with each step of instruction, it should be fairly easy to interpret these instructions

  1. Download Mesquite and launch it by double-clicking on the Mesquite icon.
  2. Copy the file angio35.txt to original-angio35.txt so that if Mesquite modifies the file in an unexpected way we can get back the original data.
  3. Open the file using File >> Open File...
  4. You should see the "project view" of the file which will have icons for various views of the data for instance you can manipulate data about the taxa by clicking on the "Taxa/List & Manage Taxa" button. Look at the character matrix by clicking on the "Show Matrix" button. Familiarize yourself with the matrix editing tools that show up in the toolbar next to the matrix. When you mouse-over a button in the toolbar a description of the tool should appear in the status box at the bottom of the window.
    Note that there are 2 toolbars: one to the left of the matrix and one under the taxa labels.

    Can you add an annotation saying "I don't trust this basecall" to the cell that corresponds to the tenth character and the 11th taxon?

  5. Go back to the Project tab and look at the "List & Manage Characters" view.

    Can you tell which characters were excluded in the exset of the file?

  6. We created some trees in PAUP*, but those were saved to another file. Let's see if we can tell Mesquite that the trees are associated with this data. Use:
    Taxa&Trees >> Import File With Trees >> Link Contents... and the browse to the pars.tre file. This should result in a Trees panel in the project tab. By selecting "Show Trees" you should be able to examine the trees in a new window.
  7. Use Analysis >> Values for Current Tree... >> Treelength to make the tree window display the number of steps required to explain the tree.

    Does the tree length agree with the parsimony score in PAUP? If not, why not?

  8. Is the tree rooted on the outgroup (Funaria hygrometrica)? If not, why not?

    Get familiar with the tools in the tree window's tool bar. Find the rerooting tool and use it to redraw the tree such that Funaria hygrometrica is the outgroup.

    Did the tree length change when you rerooted the tree?

  9. To look at a most parsimonious reconstruction of the character states, we can use Analysis >> Trace Character History >> Parsimony Ancestral Character States . The tree should now be colored (something other than black). The legend in the bottom corner of the window shows how the colors map to the states, what character is being traced, and the number of steps required to explain this character. Find a variable character and look at several rerootings of the tree.

    Does the tree length for the character ever change?

    Find a character with homoplasy in it and use the "Move Branch" tool to rearrang the tree so that there is no homoplasy in that character

    What happened to the total tree length when you made the changes?

  10. Quit Mesquite and save the changes to the file (you are welcome to continue to check out Mesquite, but we are done with it for today's lab).
  11. Open angio35.nex and look at the changes that Mesquite made.

    Can you find the annotations that you made about the cell of the character matrix? Do you think that these changes will interfere with PAUP reading the file? Were the trees from the parsimony analysis included in the file?

  12. Start PAUP* again. Execute the angio35.nex file and then use paup's GetTrees command to get the trees from the pars.tre file. Calculate the parsimony scores of the trees (if you changed them in Mesquite then they will not all be the same score as they were before you ran Mesquite.