Models and Parameters

The core capability of GeLL is the ability to define and use new models. There are two main classes involved in defining a model, Model and RateCategory. A model consists of one or more rate categories where a rate category represents a rate matrix and some method of determining root frequencies.

Here we show how to define models, starting with the simplest model, the Jukes-Cantor model which has no parameters and consists of a single rate category. We then extend this example to the GTR model, which introduces parameters, before introducing multiple rate categories and showing how the GTR+Γ model is defined. Next we define the GTR+Γ+I model which highlights most of the capabilities of GeLL.

We also show how to use the more unusual root distributions supported by GeLL. Finally we look at how we define the parameters used in the models.

Jukes-Cantor: A very simple model Show

To define the Jukes-Cantor model we start by defining the rate matrix as a two-dimensional array of strings. In this simple case every off-diagonal value is set to the text string "1.0" which is simply a numerical rate as a string. Values on the diagonal are ignored by GeLL so here are set to "-" for readability.

String[][] ma = new String[4][4];
ma[0][0] = "-"; ma[0][1] = "1.0"; ma[0][2] = "1.0"; ma[0][3] = "1.0";
ma[1][0] = "1.0"; ma[1][1] = "-"; ma[1][2] = "1.0"; ma[1][3] = "1.0";
ma[2][0] = "1.0"; ma[2][1] = "1.0"; ma[2][2] = "-"; ma[2][3] = "1.0";
ma[3][0] = "1.0"; ma[3][1] = "1.0"; ma[3][2] = "1.0"; ma[3][3] = "-";

Having defined the rate matrix we must now define the root distribution. In this example we are going to define it as part of the model although see "Root distributions" below for other options. When defining it as part of the model we simply need to define a one-dimensional array of strings in a similar way to the rate matrix.

String[] freq = {"0.25", "0.25", "0.25", "0.25"};

Finally we need to define a map from state (i.e. values in the alignment file) to position in the array.

HashMap<String,Integer> map = new HashMap<>();
map.put("T",0);
map.put("C",1);
map.put("A",2);
map.put("G",3);

We now have all the elements necessary to create our model. We first create a rate category and then pass this as the only input to the model constructor.

return new Model(new RateCategory(ma,freq,map));

GTR: Introducing parameters Show

Defining the rate matrix for models with parameters is only slightly more complex. To use parameters in the strings used to define the rate simply use a parameter name, such as a. Parameters can be combined using standard mathematical operations. See the MathsParse java documentation for more on including parameters and the avaliable operations. The GTR rate matrix is defined as follows.

String[][] ma = new String[4][4];
ma[0][0] = "-"; ma[0][1] = "a*pC"; ma[0][2] = "b*pA"; ma[0][3] = "c*pG";
ma[1][0] = "a*pT"; ma[1][1] = "-"; ma[1][2] = "d*pA"; ma[1][3] = "e*pG";
ma[2][0] = "b*pT"; ma[2][1] = "d*pC"; ma[2][2] = "-"; ma[2][3] = "pG";
ma[3][0] = "c*pT"; ma[3][1] = "e*pC"; ma[3][2] = "pA"; ma[3][3] = "-";

The root distribution is also defined similarly:

String[] freq = {"pT", "pC", "pA", "pG"};

The map is defined identically to the Jukes-Cantor example above so is not included here.

As the model contains parameters is it also necessary to create a Parameters object for use in calculating likelihoods and optimising the values of the parameters. For each parameter we create a Parameter object and then add it to the list of parameters. There are a few static methods to create the Parameter object that put different bounds on the parameters value. The two we see here limit it to a positive value or fix it's value. See "Parameters: Defining and constraining" for more information.

Parameters p = new Parameters();
p.addParameter(Parameter.newEstimatedPositiveParameter("a"));
p.addParameter(Parameter.newEstimatedPositiveParameter("b"));
p.addParameter(Parameter.newEstimatedPositiveParameter("c"));
p.addParameter(Parameter.newEstimatedPositiveParameter("d"));
p.addParameter(Parameter.newEstimatedPositiveParameter("e"));

p.addParameter(Parameter.newFixedParameter("pT",1.0));
p.addParameter(Parameter.newEstimatedPositiveParameter("pC"));
p.addParameter(Parameter.newEstimatedPositiveParameter("pA"));
p.addParameter(Parameter.newEstimatedPositiveParameter("pG"));

The model is then created in the same way as before.

It should be noted the values in the root distribution do not have to add to one. GeLL will automatically scale the values to make it a valid distribution. This is why the parameters included in the root distribution are only constrained to be positive.

GTR+Γ: Using multiple rate categories Show

The simplest way to use multiple rate categories is to convert a model, like those, defined above, to one that has rate-across-sites variation using the gamma distribution. GeLL has a built in static function to create a model using multiple gamma-distributed rate categories from a single rate category. To do this you pass the rate category, the name you wish to use for the gamma parameter and the number of categories to create.

return Model.gammaRates(new RateCategory(ma,freq,map),"g",4);

We must also add our gamma parameter to our list of parameters.

p.addParameter(Parameter.newEstimatedPositiveParameter("g"));

GTR+Γ+I: More complex rate categories Show

To use more complex multiple rate categories instead of passing a single rate category to the Model constructor you pass a map from RateCategory to String, where the String defines the frequency of that category using the same format as used in the rate matrix.

As an example we will define the GTR+Γ+I model using four gamma categories. In this example we assume we have already defined the rate matrix, root distribution and map as above and have used these to create a rate category, r.

First we create the four gamma categories. We do this by taking our original GTR category and multiplying it by a value, in this case the gamma function. For the syntax of the gamma function see the MathParse java doc. We then name the category for nicer output. Next we add the category to the map and set it's frequency to "(1 - i) / 4", where i is the proportion of invariant sites.

HashMap<RateCategory,String> freq = new HashMap<>();

for (int i = 1; i <= 4; i++)
{

RateCategory nr = r.multiplyBy("g[g," + i + ",4]");
nr.setName("Gamma Category " + i);
freq.put(nr, "(1 - i) / 4");

}

We then need to create the invariant sites category and add it to the frequency map with the appropriate frequency.

String[][] ia = new String[4][4];
ia[0][0] = "-"; ia[0][1] = "0"; ia[0][2] = "0"; ia[0][3] = "0";
ia[1][0] = "0"; ia[1][1] = "-"; ia[1][2] = "0"; ia[1][3] = "0";
ia[2][0] = "0"; ia[2][1] = "0"; ia[2][2] = "-"; ia[2][3] = "0";
ia[3][0] = "0"; ia[3][1] = "0"; ia[3][2] = "0"; ia[3][3] = "-";

String[] ifreq = {"pT", "pC", "pA", "pG"};

RateCategory ir = new RateCategory(ia,ifreq,map);
freq.put(ia,"i");

Finally we can create the model.

return new Model(freq);

Our parameters would be the same as for GTR+Γ except we need to add the invariant proportion parameter, which again will be a positive parameter.

p.addParameter(Parameter.newEstimatedPositiveParameter("i"));

Root distributions Show

For each RateCategory you can use one of several methods of defining the root distribution. All of the above examples use the constructor for RateCategory that defines a probability for each state of the root.

There is however a different constructor that allows three different methods of defining the root distribution. This constructor is:

public RateCategory(String[][] rates, FrequencyType freqType, HashMap<String, Integer> map)

The three possible values of FrequencyType allow three different ways of calculating the root frequency, namely:

FrequencyType.STATIONARY - This sets the root frequency to the stationary distribution of the rate matrix.

FrequencyType.QSTAT - This sets the root frequency to the quasi-stationary distribution of the rate matrix. The first state is assumed to be the sink state in the quasi-stationary distribution and the frequency of this state is set to zero.

FrequencyType.FITZJOHN - This uses the method of FitzJohn et al (2009) to calculate the root distribution.

Parameters: Defining and constraining Show

So far we have seen two ways of creating, and constraining, parameters, namely:

Parameter.newEstimatedPositiveParameter(String name)
Parameter.newFixedParameter(String name, Double value))

There are two other ways to create a parameter. One for an completely unbounded parameter (i.e. it can take negative values) and one for bounded parameters where you provide a lower and an upper bound:

Parameter.newEstimatedParameter(String name)
Parameter.newEstimatedBoundedParameter(String name, double lowerBound, double upperBound)

For each method that creates an estimated parameter there is a similar method that created the parameter but with a different starting value. This can be useful if, for example, optimisation is taking a long time from the default start value.

Parameter.newEstimatedPositiveParameter(String name, double start)
Parameter.newEstimatedParameter(String name, double start)
Parameter.newEstimatedBoundedParameter(..., double start)