Phylogenetics: Mesquite Lab

This lab is adopted from one by Paul Lewis. We will use the versatile program Mesquite to investigate ancestral character states.

Contents

Part A: Simulating data

You will use Mesquite today to simulate data matrices with known properties, then use PAUP* to estimate parameters from the simulated data. Why would we want to simulate data? There are a couple of reasons:

Mesquite requires you to read in a tree with branch lengths and define a substitution model before creating a simulated data matrix. The following tutorial will take you through these steps.

Start Mesquite

The various modules that compose Mesquite are indicated by icons as they are loaded.

Create a tree file

Create a file containing a tree that will serve as the "true tree" for purposes of creating a simulated data set. Copy the following and paste into a new file to create a NEXUS tree file.

#nexus

begin taxa;
 title fake;
 dimensions ntax=5;
 taxlabels A B C D E;
end;

begin trees;
  title truth;
  link taxa = fake;
  tree truetree = (A:0.5,(B:0.05,C:0.05):0.05,(D:0.5,E:0.05):0.05);
 end;  

This file has a couple of nexus commands we have not seen before. The title command simply provides a name to a NEXUS block. The link command is used by Mesquite to connect the information provided in different blocks in various ways. The link and title commands were invented for use by Mesquite, and are not part of the standard set of NEXUS commands, so it is good to be aware that using them may make your NEXUS files unreadable by other programs that read NEXUS files.

Open the file

Read this file into Mesquite using the File > Open File... menu command. You should see a new window appear with the words Taxa "fake" in the title bar.

View the tree

To show the tree that was in this file click on the "View Trees" link in the project panel (use the menu command Taxa&Trees > New tree window and then choose (Stored Trees) when Mesquite asks you for a source of trees for the tree window).

Make branch lengths proportional to values in tree file

The tree that appears is not drawn so that branches are proportional to the lengths we supplied. To fix this, choose Branches Proportional to Lengths from the Drawing menu.

Curvograms and other tree forms

Choose Drawing > Tree Form > Curvogram from the menu. Feel free to play with other tree forms. I often prefer to produce a thinner tree by choosing Drawing > Line width... and change the width to 2.

Define the model that will be used for simulating data

Now you are ready to define a model. Let's create a GTR+I+G model and simulate a data set containing 2,000 sites from it. The first step is to choose Characters > New Character Model > Composite DNA Simulation Model... from the menu above the tree window. Create a name for your new model (e.g. "test model"), then press Ok.

Creating the model

The Edit model dialog box will appear. The different facets (Mesquite calls them submodels) of the model are placed in sections of the dialog box set off by horizontal lines. The first section has to do with base frequencies. If you do nothing, Mesquite will use equal base frequencies both for the root node and the transition probabilities. If you like, you can choose empirical base frequencies for either of these, or you can create your own base frequencies by selecting User-Specified Nucleotide Frequencies... under the drop down list entitled New (there is only one choice in this list, but you must select it to proceed with creating a different base frequency submodel).

You are free to try anything you want here, but note that this tutorial will assume you left both the root states and equilibrium states set to Equal Frequencies.

Specifying rate heterogeneity

The character rates section has to do with among-site rate heterogeneity. Let's put rate heterogeneity into our simulated data using a gamma distribution (but not invariable sites). To do this, we must create a new submodel by selecting Gamma Rates Model... from the drop down list entitled New. Create a name for the new submodel you are creating (e.g. "ASRV submodel"), and then press Ok.

This will bring up the Gamma Rates Model dialog box. Enter a value for the gamma shape parameter (e.g. 0.1) and uncheck the discrete gamma checkbox. By unchecking this checkbox, we are telling Mesquite that we want it to draw a gamma-distributed relative rate separately for each site. Each site will have a different relative rate. If we leave the discrete gamma checkbox checked, Mesquite would draw one of the 4 representative relative rates for each site. In this case, approximately one fourth of the sites would evolve according to the representative rate for the first rate category, one fourth of the sites would evolve according to the representative rate for the second category, and so on. In analyzing data, we always (for practical reasons) approximate the gamma distribution by using the discrete gamma approach, but when simulating data you have more flexibility.

Specifying the rate matrix

Finally, we come to the rate matrix model. Let's use the GTR model for our simulated data. In the drop down list entitled New, choose GTR Rate Matrix Model..., then choose a name (e.g. "GTR submodel") and press the Ok button. You are now presented with the GTR Rate Matrix Model dialog box. Enter a relative rate for each of the 5 substitution classes possible. Mesquite forces the last relative rate (G to T) to be 1, so the other rates you specify will be relative to the GT transversion. I entered, just for fun, 6 for AC, 5 for AG, 4 for AT, 3 for CG and 2 for CT.

Leave the scaling factor set to its default value of 1.0, then press the Ok button to close the Edit model dialog box.

Make it so

You have now specified a tree with branch lengths and a substitution model, so you are ready to simulate. Choose Characters > Make new matrix from > Simulated matrices on Current Tree from the menu attached to the tree window. This will bring up a dialog box asking which kind of model you wish to use. We are simulating DNA data, so choose Evolve DNA Characters and press the Ok button.

Now you are asked to choose a specific DNA character model. You should see the one you defined listed (I called it "test model", but you may have used a different name). Choose the model you defined and press Ok to continue.

Press Ok to confirm use of the current tree for simulating data.

Enter 2000 for the number of characters to simulate, then press Ok.

Enter a name for your simulated data matrix (e.g. "simulated data"), then press Ok.

After a delay, you should see your simulated data matrix appear in a new window. To save this dataset to a nexus file, choose File > Save File As... in the menu attached to your data matrix window. Choose a file name and click Ok.

Analyzing simulated data in PAUP*

Important: do not close Mesquite - you will be simulating more data in a minute, and if you close Mesquite at this point, you will need to start over again from scratch!

Open the file you just saved in PAUP*, set up the GTR+G model using the lset command below

set criterion=likelihood;
lset nst=6 basefreq=equal rates=gamma shape=estimate rmatrix=estimate;

and then use lscores 1 to ask PAUP* to estimate the model parameters.

Save the values reported by the lscores command in a file for later reference. You can copy the values from PAUP*'s output by choosing Edit > Edit Display Buffer from PAUP*'s main menu.

Generating more data

Repeat the exercise, this time creating 10,000 sites rather than only 2,000. This would give PAUP* 5 times more data to work with when estimating parameters. Note that you do not have to start over from scratch: you have already set up the model and tree in Mesquite, you simply need to instructe it to generate more data.

Choose Characters > Make new matrix from > Simulated matrices on Current Tree from the menu attached to the tree window. This will once again bring up a dialog box asking which kind of model you wish to use. Choose Evolve DNA Characters and press the Ok button.

Now you are asked to choose a specific DNA character model. Once again, choose the model you defined and press Ok to continue.

Press Ok to confirm use of the current tree for simulating data.

This time, enter 10000 for the number of characters to simulate, then press Ok.

Enter a different name this time for your simulated data matrix (e.g. "more simulated data"), then press Ok.

After a delay, you should see your simulated data matrix appear in a new window.

Before saving this dataset to a nexus file, it is important to delete the previous data set, which Mesquite still remembers. The problem is that when you save data to a file, Mesquite saves all data it knows about to the file. If you chose File > Save File As... right now, you would end up with a file that contains two data matrices: something that PAUP* will not like. To tell Mesquite to forget about the first (2000-site) data matrix, follow these directions:

Now you can save the file by choosing File > Save File As... from the menu. Follow the directions in the section Analyzing simulated data in PAUP* for analyzing this data set in PAUP*.

Part B: Estimating ancestral states

Create a data set

Create a new nexus data file containing this text:

#NEXUS

begin taxa;
  title fake;
  dimensions ntax=4;
  taxlabels A B C D;
end;

begin trees; 
  title truth;
  link fake;
  tree hw9 = [&U] (A:0.2,B:0.2,(C:0.1,D:0.1):0.1);
end;

begin characters;
  dimensions nchar=2;
  format datatype=standard;
  matrix
    A 01
    B 00
    C 11
    D 11
  ;
end;

Start Mesquite

Exit Mesquite, if you haven't already done so, by closing all windows. Upon closing the last window, you will have to push the Force quit button to get it to actually quit. It is probably not necessary to stop and restart Mesquite, but Mesquite is a complex program, and I tend to start with a clean slate rather than risk having Mesquite remember settings that I have long forgotten about!

Now start it up again and open this newly-created file using the File > Open... menu command. You should see the data matrix appear in a Character Matrix window.

Set up the model

Choose Characters > New character model > Mk1 Model (Markov 1 parameter) from the menu at the top of the Character Matrix window. Choose a name for your model (e.g. "my model").

An Edit model dialog box will appear. Enter 1.0 in the box provided and press the Enter key. The 1 here means that branch lengths will be interpreted in the usual way, as the expected number of substitutions. Mesquite does things a little differently than PAUP*:

Mesquite can correctly equate branch lengths with expected number of substitutions if we specify the substitution rate to equal 1.

Set up the tree window

Go back to the Character Matrix window and choose Taxa&Trees > New Tree Window. In the dialog box that appears, choose Stored Trees and press the Ok button.

From the menu at the top of the Tree Window that appears, choose Drawing > Tree Form > Balls & Sticks, and Drawing > Branches Proportional to Lengths.

Estimating ancestral states using likelihood

From the Analysis menu, choose Trace Character History and choose Stored Characters, Likelihood Ancestral States, Stored Probability Model, and "my model" in the succession of dialog boxes that appear.

From the Trace menu, choose Report Likelihoods As > Raw likelihoods, then hover your mouse cursor over the node at the base of the tree. The value associated with state 0 should be shown as 0.02874, and that for state 1 should be 0.01023. Adding these two values together, you get 0.03897, which is the total likelihood for this character (the log-likelihood is ln(0.03897) = -3.24495). The pie diagram at the node thus shows the proportion of the total likelihood associated with each possible state. The state that contributes most to the likelihood is the estimated ancestral state.

The winning threshhold

Technically, state 0 is the winner here, but Mesquite does not consider there to be a clear cut choice in this case. If Mesquite did consider there to be a clear winner, it would adorn that state with an asterisk to indicate that it is a significantly better choice of ancestral state. Mesquite uses a cutoff to determine whether one state is significantly more likely than another state.

The cutoff is applied to the log-likelihood rather than the likelihood. Switch to showing the (negative) log-likelihoods by choosing Trace > Report Likelihoods As > Negative Log Likelihoods. Now state 0 is seen to have a log-likelihood equal to -3.54934 and state 1 has a log-likelihood equal to -4.58274. The difference between these is 1.0334, which is less than the cutoff (2.0) assumed by Mesquite. You can change the cutoff using Trace > Likelihood Decision Threshhold...

Estimating ancestral states using parsimony

You can see what parsimony would say in this case by choosing Analysis > Trace Character History > Stored Characters > Parsimony Ancestral States > Current Parsimony Models. This will create a new Trace Character pane, which will probably be created right on top of your existing one. Use your mouse to move it off the likelihood Trace Character pane, so you can see the two side by side. Note that, unlike likelihood, parsimony has no qualms about choosing 0 as the ancestral state of the basal node for character 1.

Estimating non-independent evolution between discrete characters

Pagel's (1994) model for detecting correlated evolution between discrete characters using model selection techniques is implement in BayesTraits and Mesquite. We will work through a simple example, to demonstrate how you can conduct this analysis in Mesquite.

The data file

Download the file pagelPrimate.nex. It is a NEXUS version of an example that Pagel demonstrated the method on (and which is distribute with BayesTraits). The characters are oestrus advertisement and multi-male social systems (the theory is that in species in which females mate with multiple males, the females are more likely to evolve displays of their reproductive status). The tree is a single tree taken from a Bayesian MCMC run.

I had to get rid of some missing data cells, before Mesquite would perform the correlation analysis.

Trace characters

If you look at the tree, and then Select "Analysis > Trace Character Over Trees" you can get a sense of the pattern of evolution for these characters (a parsimony reconstruction is fine for this purpose). Do you think that the analysis will find evidence for correlated evolution?

Because the analysis is going to estimate several parameters from a single pair of characters you should have a fair number of species if you want to perform this analysis. The tree must also have branch lengths.

Correlation Analysis

You can select the analysis (assuming that you have Correl module of Mesquite) by choosing "Analysis > Correlation Analysis." The dialog box will give you options for how many repeated times Mesquite will try to optimize the ML score (the likelihood surface for 8 parameters and 2 characters worth of data is often very multimodal and hard do explore using optimization techniques -- so you may have to use more than 10 reps on difficult datasets).

"Check" present P-value box. Rather than rely on Chi-square asymptotics for the critical value for the LRT statistic, Mesquite will perform Monte Carlo simulations under the null hypothesis to calculate a P-value. For your own work, you would want more than 100 reps, but for the sake of brevity you can keep the setting at 100 for this lab.

Conduct the analysis.

Was their enough evidence to reject the four-parameter model that assumes independent evolution (in favor of the eight-parameter model)?

BayesTraits gives you more flexibility on how you would like to constrain the null model (and other options), but Mesquite is a bit easier to use.

Ancestral characte estimation in the face of effects on diversification rates

One assumption of our Markov model of character evolution is that they do not affect the process that causes the branching in the tree. If a character does affect the speciation or extinction rate, then our standard way of calculating the likelihood is not applicable. Intuitively, it makes sense that if a character state is associate with higher extinction rates then we may fail to reconstruct some ancestors as having the state because the species are pruned in a non-random way. The BiSSE model (Maddison, Otto, and Midford) can estimate ancestral character states jointly with diversification rates associated with different character states. The R-package diversitree is a bit more powerful and feature-rich (in particular it is applicable even if you have incomplete sampling). But we'll stick with Mesquite for this lab.

The data file

Download the file bisse.nex. This data file is simply a contrived example designed to show how some of Mesquites diversification features work (it is distributed with Mesquite in the example directory).

Trace characters

Trace the character on the tree. Does there appear to be an effect of the character state on the diversification rate? Which character state seems to be associated with the higher diversification rate?

Because the analysis is going to estimate several parameters from a single pair of characters you should have a fair number of species if you want to perform this analysis. The tree must also have branch lengths.

Character-Associated Diversification analysis

Choose "Character-Associated Diversification" from the Analysis menu. Try out both the "BiSSE Speciation/extinction Likelihood" and the "BiSSE net diversification Likelihood" By scoring repeatedly with different parameters constrained to be equal, you should be able to explore the space of reasonable models.

If you used the AIC, would your explorations of this data set lead you to prefer a model in which the diversification rate is affected by the character state?