The core capability of GeLL is the ability to define and use new models.
There are two main classes involved in defining a model, Model
and RateCategory
. A model consists of one or more
rate categories where a rate category represents a rate matrix and some method of
determining root frequencies.
Here we show how to define models, starting with the simplest model, the Jukes-Cantor model which has no parameters and consists of a single rate category. We then extend this example to the GTR model, which introduces parameters, before introducing multiple rate categories and showing how the GTR+Γ model is defined. Next we define the GTR+Γ+I model which highlights most of the capabilities of GeLL.
We also show how to use the more unusual root distributions supported by GeLL. Finally we look at how we define the parameters used in the models.
To define the Jukes-Cantor model we start by defining the rate matrix as a two-dimensional array of strings. In this simple case every off-diagonal value is set to the text string "1.0" which is simply a numerical rate as a string. Values on the diagonal are ignored by GeLL so here are set to "-" for readability.
Having defined the rate matrix we must now define the root distribution. In this example we are going to define it as part of the model although see "Root distributions" below for other options. When defining it as part of the model we simply need to define a one-dimensional array of strings in a similar way to the rate matrix.
Finally we need to define a map from state (i.e. values in the alignment file) to position in the array.
We now have all the elements necessary to create our model. We first create a rate category and then pass this as the only input to the model constructor.
Defining the rate matrix for models with parameters is only
slightly more complex. To use parameters in the strings used to
define the rate simply use a parameter name, such as a.
Parameters can be combined using standard mathematical
operations. See the MathsParse
java documentation for more on including parameters and the
avaliable operations. The GTR rate matrix is defined as
follows.
The root distribution is also defined similarly:
The map is defined identically to the Jukes-Cantor example above so is not included here.
As the model contains parameters is it also necessary to
create a Parameters
object for use
in calculating likelihoods and optimising the values of the
parameters. For each parameter we create a
Parameter
object and then add it to the list of
parameters. There are a few static methods to create the
Parameter object that put different bounds on the parameters
value. The two we see here limit it to a positive value or
fix it's value. See "Parameters: Defining and constraining"
for more information.
The model is then created in the same way as before.
It should be noted the values in the root distribution do not have to add to one. GeLL will automatically scale the values to make it a valid distribution. This is why the parameters included in the root distribution are only constrained to be positive.
The simplest way to use multiple rate categories is to convert a model, like those, defined above, to one that has rate-across-sites variation using the gamma distribution. GeLL has a built in static function to create a model using multiple gamma-distributed rate categories from a single rate category. To do this you pass the rate category, the name you wish to use for the gamma parameter and the number of categories to create.
We must also add our gamma parameter to our list of parameters.
To use more complex multiple rate categories instead of passing a single rate category to the Model constructor you pass a map from RateCategory to String, where the String defines the frequency of that category using the same format as used in the rate matrix.
As an example we will define the GTR+Γ+I model using
four gamma categories. In this example we assume we have
already defined the rate matrix, root distribution and map as
above and have used these to create a rate category,
r
.
First we create the four gamma categories. We do this by
taking our original GTR category and multiplying it by a value,
in this case the gamma function. For the syntax of the gamma
function see the MathParse
java doc.
We then name the category for nicer output. Next we add the
category to the map and set it's frequency to "(1 - i) / 4",
where i is the proportion of invariant sites.
We then need to create the invariant sites category and add it to the frequency map with the appropriate frequency.
Finally we can create the model.
Our parameters would be the same as for GTR+Γ except we need to add the invariant proportion parameter, which again will be a positive parameter.
For each RateCategory
you can use one of several
methods of defining the root distribution. All of the above
examples use the constructor for RateCategory
that
defines a probability for each state of the root.
There is however a different constructor that allows three different methods of defining the root distribution. This constructor is:
The three possible values of FrequencyType
allow
three different ways of calculating the root frequency, namely:
FrequencyType.STATIONARY
- This sets the root frequency to the stationary
distribution of the rate matrix.
FrequencyType.QSTAT
- This sets the root frequency to the quasi-stationary
distribution of the rate matrix. The first state is assumed
to be the sink state in the quasi-stationary distribution and
the frequency of this state is set to zero.
FrequencyType.FITZJOHN
- This uses the method of FitzJohn et al (2009) to
calculate the root distribution.
So far we have seen two ways of creating, and constraining, parameters, namely:
There are two other ways to create a parameter. One for an completely unbounded parameter (i.e. it can take negative values) and one for bounded parameters where you provide a lower and an upper bound:
For each method that creates an estimated parameter there is a similar method that created the parameter but with a different starting value. This can be useful if, for example, optimisation is taking a long time from the default start value.