GeLL Driver

java -jar GeLL.jar settings {var1 var2 var3 ...}

where settings is the name of a settings file. var1 etc are variables that can be "passed" to the setting file. A $ followed by a number in the settings file will be replaced by the corresponding command line argument. The format of the settings file is described below.

A run of the of the driver is controlled by the settings file. The settings file has four options sections. The start of each of these sections should begin with the sections name in square brackets, e.g. [Control]. Each section is optional. The control section contains general control sections while the likelihood section controls likelihood optimisation. The ancestral and simulation sections control the expected processes. Although each section is optional if a section it may have settings that must be set. These are shown by the darker background.

Control section

DebugLevel	The amount of debug information that is displayed when an error occurs. Valid values are: `1 -` Default. Just an error message is logged. `2 -` An error message and stack trace is logged. `3 -` An error message and stack trace is logged along with the message and trace of any underlying exception.
DebugFile	File to log debug information to. If no file is given debug information is printed to screen.
Distributions	How stationary and quasi-stationary distributions are calculated. Valid values are: `Repeat -` Default. Distributions are calculated by repeated application of a P-matrix to a distribution. `Eigen -` Distributions are calculated using Eigendecompositions.
MatrixExponentation	How matrix exponentiations are calculated. Valid values are: `Taylor -` Default. Exponentiations are calculated by a Taylor expansion. `Eigen -` Exponentiations are calculated using Eigendecompositions.
ForceSquare	The minimum number of repeated squaring steps to use when calculating matrix exponentiations using the Taylor method. Defaults to 0.

Likelihood section

AlignmentType	The type of alignment input. See alignment files below for a description of the file formats. Valid values are: `Sequence -` The "alignment" is in sequence format. `Duplication -` The "alignment" is in duplication format.
Alignment	Path to the alignment file.
TreeInput	Path to the input tree file. This file should contain one line containing a tree in Newick format.
Model	Path to the model file. See the Model file description below for format.
ParameterInput	Required unless `Restart` is used. Path to the parameters input file. See the Parameter file description below for format.
Ambig	Path to a file describing any ambiguous states in the alignment.
Missing	Path to an alignment that gives the unobserved data. In the same format as the alignment.
MissingAmbig	Path to a file describing any ambiguous states in the missing alignment.
Optimizer	The optimiser to use. Valid values are: `GoldenSection -` Golden section search. `NelderMead -` Neader-Mead optimisation.
Checkpoint	File to write checkpoints to. This allows the optimization to be restarted using the `Restart` setting should it be interrupted.
CheckpointFreq	How often (in minutes) the checkpoint file should be written.
Restart	Checkpoint file to restart optimisation from.
TreeOutput	File to output the estimated tree to. If this option is not given then no output is written.
ParameterOutput	File to output the estimated parameters to. If this option is not given then no output is written.
Rescale	Whether to rescale the matrix to one event pet time unit. Any value beginning with f is false, all other values are true. Defaults to true.
OptimizeTree	Whether to optimize the tree branch lengths or use those provided. Any value beginning with f is false, all other values are true. Defaults to true.

Ancestral section

AlignmentType	Required if no Likelihood section. Same meaning as in Likelihood section.
Alignment	Required if no Likelihood section. Same meaning as in Likelihood section.
Tree	Required if no Likelihood section. Same meaning as `TreeInput` in Likelihood section.
Model	Required if no Likelihood section. Same meaning as in Likelihood section.
Parameters	Required if no Likelihood section. Same meaning as `ParameterInput` in Likelihood section.
Type	The type of reconstruction to do. Valid values are: `Joint -` Default. Joint reconstruction. `Marginal -` Marginal reconstruction.
Output	File to write the reconstructed alignment to.

Simulate section

AlignmentType	Required if no Likelihood section. Same meaning as in Likelihood section.
Tree	Required if no Likelihood section. Same meaning as `TreeInput` in Likelihood section.
Model	Required if no Likelihood section. Same meaning as in Likelihood section.
Parameters	Required if no Likelihood section. Same meaning as `ParameterInput` in Likelihood section.
Missing	Path to an alignment that gives the unobserved data. In the same format as the alignment.
Length	The length of the simulate alignment.
Output	File to write the reconstructed alignment to.

Alignment files can be in one of two different formats:

Sequence - File should be in a format similar to Phylip. The first non-blank line, which normally gives the number and length of the sequence, is ignored. Further non-blank lines represent each sequence, one per line. Anything from the start of the line to the first white space is considered the taxa's name. Anything after the first whitespace is the sequence. Whitespace in the sequence is ignored. A taxa name of *class* is assumed not to be a taxa but rather gives the class of each site (which can be any single character).
Duplication - File is tab separated. First row is a header file. First field is ignored while subsequent fields are the name of the species. Each additional row represents a family. The first field is an ID for the family while subsequent fields are the size of the family in the appropriate species. A family name of *class* is assumed not to be a taxa but rather gives the class of each site (which can be any string).

Each line represents a single parameter. Lines are tab separated. The first field is the type of the parameter and the second is the name of the parameter. Subsequent fields depend on the parameter type. Type values are:

EB - Estimated bound parameter that is in a rate matrix. 3rd field is the lower bound, 4th the upper.
EP - Estimated positive parameter that is in a rate matrix.
E - Estimated (unbounded) parameter that is in a rate matrix.
F - Fixed parameter. 3rd field is the value.

The first line controls the type of model. Possible types and the subsequent format of the rest of the file are:

Gamma distributed rate categories
- First line should start **G followed by a tab, followed by the number of categories desired. This should be followed a tab and the parameter name the alpha value is to be called by.
- Second line should contain a file path to the RateCategory file that describes the basic model.
Equally likely rate categories
- First line should contain **E
- Subsequent lines should each contain a file path to a RateCategory file
Given frequency rate categories
- First line should contain **F
- Subsequent lines should each contain an equation describing the frequency of that ratecategory (see Equation Format below for the format of this equation) followed by a tab followed by a file path to a RateCategoy file.

To use different models for different site classes the format of this file is different. In this instance each line of the file will represent one class and will contain two fields tab separated. The first field is the class identifier while the second is the file name of a file in the normal model format (above) that defines the model for that class.

The file format is described below. See Equation Format below for a description of the format of the equations that can be in the rate matrix and root distribution.

First line contains the number of states the RateCategory has
Second line is blank
Third line is a list of states in the order they appear in the rate matrix, tab-separated
Forth line is blank
Fifth and subsequent lines contain the rate matrix, one row per line. Columns in a row are separated by tabs. Each entry can be an equation.
The rate matrix is followed by a blank line
Finally thee is a line giving the base frequencies. Three different values are allowed:
1. Model frequencies - This line contains an equation for the frequency of each state, in the same order as the rate matrix and tab-separated
2. Stationary distribution - Line contains just **S
3. Quasi-stationary distribution - Line contains just **Q
4. FitzJohn distribution - Line contains just **F

File should be a tab delimited file with one ambiguous character per line. The first field on each line is the ambiguous character while also subsequent field represents a character that could be represented by it

Variables are represented by a letter followed by any number of alphanumeric characters. Multiply (represented by *) should be stated explicitly, e.g. a * b NOT a b or ab (the later of which would be parsed as a single variable). Functions should be represented by f[a,b,...] where f is the function name and a,b etc. are inputs. Inputs cannot contain other functions but can otherwise contain an expression. The following functions are defined:

ln[a] - The natural logarithm
g[a,b,c] - The rate modifier of the bth class of c classes using a gamma distribution with alpha value of a as per Yang 1993.