####################################################################################
INSTALLING, COMPILING AND CONFIGURING MRBAYES
####################################################################################
*************************
I. Obtaining the program
1. Binary (precompiled) verison
The binary can be installed by downloading the Mac installer from this url:
http://sourceforge.net/projects/mrbayes/files/latest/download?source=files
Double click the installer and follow the prompts.
2. Building from the source code
To compile the binary from source, you will need the gcc compiler. To confirm that
you have it type:
> which gcc
This should result in the directory location of your current copy of gcc, if you
have one installed. If not, install one from the Developer Tools CD that came with
your computer. Most Unix systems will already have gcc installed.
To obtain the source code, open the Terminal application (in the Application/Utilities
directory) and type:
> svn co https://mrbayes.svn.sourceforge.net/svnroot/mrbayes/trunk/src mrbayes
You should now get a number of files downloaded to your directory, in a folder named ”mrbayes”.
Alternatively, you can download the tarball here:
http://sourceforge.net/projects/mrbayes/files/mrbayes/3.2.1/mrbayes-3.2.1.tar.gz/download
Unzip the tarball and move the mrbayes directory to the desired location on your computer.
Change to the mrbayes directory by typing:
> cd mrbayes
Create the Makefile, which contains the instructions for the compiler (the ”make” command), by typing:
> autoconf
Then type:
> ./configure
if you want the serial version, or type:
> ./configure --enable-mpi=yes
if you want the mpr version of the program that can take advantage of multiple processors/
cores.
Note: The MPI-version must be compiled from source and will only run under Unix/Linux operating
systems (inckuding Mac OS).
If you wish to take advantage of the GPU processor for likelihood calculations,
make sure you have (1) an NVIDIA graphics card, and (2) the CUDA driver available
here:
http://www.nvidia.com/object/cuda-mac-driver.html
and the Beagle library avaialbel here:
http://code.google.com/p/beagle-lib/
Note: if you will primarily be analyzing nucleotifde data under conventional
4by4 substitution models, the GPU verison will not confer a speed advantage.
If you don't want to use the Beagle-enabled GPU version, type:
> ./configure --enable-mpi=yes --with-beagle=no
To make sure everything is tidy, thyep
> make clean
and then compile the program by typing:
> make
It will take a moment for the compiler to assemble the binary version of the program.
*************************
II. Starting the program
You can run the program by typing:
> ./mb
We're also going to put the mb binary in our path so that the Terminal knows where
to look for the program. Open the Terminal application and type
> vi .profile
This opens a file called ".profile" in the vi editor. Once ".profile" is open, type
> i
to begin editing (you should see "—Insert—" at the bottom of the Terminal window).
Now simply type
> export PATH="$PATH:/Applications/[wherever your directory is located]/mrbayes_3.2.1/"
to add mrbayes to your path, of course, changing '[wherever your directory is located]'
to to the directory path where your mrbayes directory is located.
When finished, press esc to stop editing and then close
the ".profile" file by typing
> :wq.
Now you should be able to execute the program from any directory by simply typing:
> mb
####################################################################################
ANALYSIS PIPELINE
####################################################################################
I. Overview
This tutorial demonstrates several types of analyses that might be required for
a typical empirical study. Specifically, we will demonstrate how to use MrBayes
to perform the following analyses:
-select among partition schemes (mixed models)
-perform model averaging by means of rjMCMC
-estimate ancestral state for discrete traits
-test for violation of the molecular-clock assumption
-estimate divergence times
**************************************************
1. Selecting among partition schemes (mixed model)
To accommodate process heterogeneity within and/or between various gene(omic)
regions, we will evaluate the support for various various partition schemes
using Bayes factors to compare the marginal likelihoods of the candiate partition
schemes. We will estimate the marginal likelihood for each partition scheme
using both the harmonic-mean (unreliable) and stepping-stone (preferable) estimators.
Open the batch file, conifer_partn.nex, in a text editor. This file contains
all of the commands required to perform the necessary analyses to explore various
partition schemes (unpartitioned, partitioned by gene region, and partitioned by
gene region+codon position. The details of each command are described in
adjacent comments, surrounded in brackets; e.g., [this is a comment].
The relevant results of these analyses are the estimated marginal likelihoods,
summarized in the table below:
Partition Marginal lnL estimates
scheme Harmonic mean Stepping-stone
--------------------------------------------------
uniform -9905.51 -9908.26
moderate -9660.15 -15739.58
extreme -9478.21 -19107.15
--------------------------------------------------
Note that Bayes factors based on comparison of HM-based marginal likelihoods
would *strongly* favor the most extremely partitioned mixed model, whereas
comparison of SS-based marginal likelihoods strongly prefers the simplest,
unpartitioned mixed model. The conflict in the preferred partition scheme
stems from the unreliability of the HM estimator. Never, ever, ever
base Bayes factors on comparison of HM-derived marginal likelihoods.
**************************************************
2. Model averaging by means of rjMCMC
Typically, we first chose a stochastic model of substitution by some means
(hLRT, AIVC, etc.), and then condition our inferences on the chosen model. This
practice ignores uncertainty in the choice of substitution model, which in turn
may bias our inferences. Rather than condition inference on a single model,
we may instead treat the substitution model as a random variable, and integrate
inference of phylogeny over all possible 203 members of the GTR family of substitution
models. We will demonstrate how to do this using reversible-jump MCMC, which
simultaneously provides both a marginal estimate of the phylogeny (topology and branch
lengths) and also the marginal probability of of the individual substitution models.
Open the batch file, conifer_rjMCMC.nex, in a text editor. This file contains
all of the commands required to perform an analysis in which the model is treated
as a random variable. The details of each command are described in adjacent
comments, surrounded in brackets; e.g., [this is a comment].
For the sake of comparison (or just for fun!), let's use a conventional model-selection
method to choose among models. While the rjMCMC analysis is running, we'll use
jModelTest to select a substitution model.
jModelTest returned the following results based on the AIC method:
Model -lnL K AIC delta weight cumWeight
------------------------------------------------------------------------
GTR+G 9887.0969 25 19824.1938 0.0000 0.7759 0.7759
TIM1+G 9890.3440 23 19826.6881 2.4943 0.2229 0.9988
TIM3+G 9895.7602 23 19837.5204 13.3266 0.0010 0.9998
TrN+G 9899.0314 22 19842.0628 17.8690 0.0001 0.9999
These results suggest that the most complex model evaluated, GTR+G, provides
the best fit to the data, and that there is little model uncertainty: only
one model is included in the 95% confidence interval.
The relevant results of the MrBayes analysis are the marginal likelihoods for the
various members of the GTR family of substitution models, summarized in the table below:
Model probabilities above 0.050
Estimates saved to file "conifer_rjmcmc.mstat".
Posterior Standard Min. Max.
Model Probability Deviation Probability Probability
----------------------------------------------------------------------------------
gtrsubmodel[123343] 0.321 0.021 0.306 0.336
gtrsubmodel[123345] 0.175 0.033 0.152 0.198
gtrsubmodel[123454] 0.130 0.011 0.123 0.138
gtrsubmodel[123453] 0.078 0.001 0.077 0.079
gtrsubmodel[123341] 0.062 0.007 0.057 0.067
gtrsubmodel[123141] 0.061 0.001 0.060 0.061
gtrsubmodel[123456] 0.058 0.001 0.057 0.059
----------------------------------------------------------------------------------
First, note that the substitution model with the highest marginal probability is an
unnamed model that differs from that identified by jModelTest, which selected GTR+G.
In fact, the model with the highest marginal probability is not even considered
by standard model-selection methods!
Second, note that there are at least 7 models comprising the 95% credible interval
of models, some models--such as [123345] and [123454]--have substantial marginal
probability.
In such cases, it would not be defensible to condition the analyses on any one
model; it is far preferable in such cases to integrate over all models in order
to accommodate the uncertainty associated with the choice of substitution model.
**************************************************
3. Ancestral-state estimation
Many evolutionary studies involve questions related to the evolution of discrete
phenotypic (morphological, physiological, ecological) traits, particularly the
state of a trait in one or more MRCAs.
Analyses of trait evolution can be performed in MrBayes in a fully hierarchical
manner. That is, we can simultaneously estimate the joint posterior probability
density of the phylogeny and all other model parameters for a mixed data set
containing both molecular and morphological data. We can then look marginally
at the state of a morphological trait (or nucleotide for a site) at a node
of interest. Note that it is best to estimate ancestral states one node at a time.
This analysis requires that we specify normal variables--the data partitions,
the CTMM models for each partition, etc.--but we must also specify (1) the node
that we wish to monitor (representing the MRCA of interest), (2) we must force
the monitored node to be monophyletic, and (3) we must specify which trait(s) we
wish to monitor. We should therefore only evaluate ancestral states for
well supported nodes.
In this example, we will estimate the ancestral state for branching order--whether
spiral or opposite--in the MRCA of conifers, the monophyly of which is well
supported.
Open the nexus file, conifer_anc.nex, in a text editor. This file contains
all of the commands required to perform an analysis in which the model is treated
as a random variable. The details of each command are described in adjacent
comments, surrounded in brackets; e.g., [this is a comment].
95% HPD Interval
--------------------
Parameter Mean Variance Lower Upper Median min ESS* avg ESS PSRF+
---------------------------------------------------------------------------------------------------------
TL{all} 1.469551 0.060504 1.044280 1.925985 1.426342 66.85 71.85 1.046
.
.
.
p(0){1@conifers} 0.474846 0.008713 0.274042 0.502302 0.500000 83.66 95.86 1.002
p(1){1@conifers} 0.525154 0.008713 0.496190 0.720240 0.500000 83.66 95.86 1.002
---------------------------------------------------------------------------------------------------------
These results suggest that there is considerable uncertainty regarding the
ancestral state of branching order (as spiral or opposite) in the MRCA of
conifers: the marginal probabilities for both states are quite similar
(P~0.5)
For these analyses, it may be worthwhile to explore alternative priors and/or
values for parameters used to model the morphological traits. Additionally, it
may be more appropriate to estimate marginal probabilities of ancestral states
from a chronogram (an ultrametric tree where the branch lengths are proportional
to time--the branch lengths in phylograms are rendered as proportional to the
expected number of character changes per trait). Below we demonstrate how to
perform the necessary analyses to estimate divergence times.
**************************************************
4. Testing for violation of the molecular-clock assumption
Many evolutionary questions will require (or at least benefit from) estimates
of divergence times. These inferences initially relied on the (empirically very
shaky) assumption that rates of molecular evolution were approximately constant
through time and across lineages--in other words, estimating divergence times
assumed a strict molecular clock model.
Empirical evidence suggests that the vast majority of data sets do not conform
to a molecular clock--exhibiting substantial variation in rates of substitution
across lineages. The prevalence of this empirical observation has motivated
theoreticians to develop and implement many relaxed-molecular clock model
(which explicitly model how rates vary across lineages), and discouraged
empiricists from bothering to assess whether their data conform to a clock
(I guess this would be a case of a very strong prior!).
Nevertheless, it is a critical first step to test for a clock--simulation studies
suggest that if your data are clock like, a strict clock is your best choice of
model to estimate divergence times (and a relaxed clock is not needed as it
just inflates error variance).
Here we run two types of analyses: one in which the clock is enforced, and another
where it is not. For each type of analysis, we estimate the marginal likelihood
of the clock and non-clock models with both the harmonic-mean and stepping-stone
estimators.
Open the nexus file, conifer_clock.nex, in a text editor. This file contains
all of the commands required to perform the necessary analyses to determine
whether the conifer sequences conform to a molecular clock. The details of each
command are described in adjacent comments, surrounded in brackets; e.g.,
[this is a comment].
The relevant results of the MrBayes analyses are the marginal likelihoods for the
clock and non-clock models estimated by the HM and SS methods, summarized in the
table below:
Branch Marginal lnL estimates
lengths Harmonic mean Stepping-stone
--------------------------------------------------
clock -9675.05 -17824.30
non-clock -9743.27 -21068.95
--------------------------------------------------
In contrast to the analyses performed to select among partiton schemes,
Bayes factors based both on comparison of HM-based marginal likelihoods
and those based on comparison of SS-based marginal likelihoods would
*strongly* favor the molecular-clock hypothesis (a rarity indeed...quick
submit this result to Nature!).
This result suggests that inference of divergence times would preferably
be performed under a strict molecular-clock model. Nevertheless, for the
sake of demonstration (and because the clock is rejected for the vast
majority of empirical data sets) we will proceed below to demonstrate how to
estimate divergence times using a relaxed molecular clock.
**************************************************
5. Estimating divergence times from
Many inference problems will require estimates of absolute or relative divergence
time: trait evolution, biogeography, lineage diversification are just some of
the kinds of problems that benefit from temporal information.
Numerous relaxed-molecular clock models have been proposed, and as is the case for
any model-based inference, we need to carefully assess the relative fit of our
data to the candidate models. Moreover, many of these models involve hyperparameters
that govern the way that substitution rates vary across branches. Being good
Bayesians, it is important to explore the form of the prior for a given model
(e.g., variance in the rate of substitution may be fixed or modeled as an exponential,
uniform or Dirichlet random variable). Moreover, for a given (hyper)prior, we should
explore the parameter value (the mean of the exponential, etc.)
Here we perform several analyses of divergence times under the strict- and
three relaxed-molecular clock models. For each clock model, we will estimate
the marginal likelihood using both the harmonic-mean (HM) and stepping-stone (SS)
estimators.
Open the nexus file, conifer_dte.nex, in a text editor. This file contains
all of the commands required to perform the necessary analyses to estimate
divergence times under various clock models. The details of each
command are described in adjacent comments, surrounded in brackets; e.g.,
[this is a comment].
Clock Marginal lnL estimates
model Harmonic mean Stepping-stone
--------------------------------------------------
strict -9673.11 -14761.50
CPP -9765.29 -21831.11
IGR -9761.98 -19057.93
TK02 -9779.39 -16955.77
--------------------------------------------------
Again, Bayes factors based on comparison of marginal likelihoods (derived from
both the HM and SS estimators) favor the strict clock model, with the autocorrelated
TK02 relaxed-clock model placing second (which makes sense, as it is 'closer' to
the strict clock in model space.
If we were performing these analyses 'for real', however, we would want to
explore the sensitivity of the results to various priors for each of the
clock models; e.g, whether we chose a fixed, exponential, or uniform prior for
variance parameter of the IGR and TK02 models (and for each chosen prior,
the value of the hyperprior), the clock rate prior, the branch-length priors,
etc., etc.