GARLI

(Genetic Algorithm for Rapid Likelihood Inference)

A tip to make the OS X Terminal easier to use (click to expand/hide)

NOTE: Downloading the FigTree tree viewer is recommended before doing this tutorial: http://tree.bio.ed.ac.uk/software/figtree/

getting started
exercise 1: start a basic nucleotide run
exercise 2: monitor an ongoing run
exercise 3: inspect the final results of a run
exercise 4: use gap (insertion-deletion) models
exercise 5: use a partitioned model

getting started

Some general things to know about GARLI

GARLI has no user interface, and is generally a command line program (although it can be double-clicked on Windows)
All configuration information is read from a text file, by default named garli.conf. The program will look for this file in the current directory, but any filename can be specified on the command line.
There are a lot of settings in the configuration file, but very few need to be changed for a basic analysis.
All settings are extensively documented on the GARLI Support Wiki
This is a fairly short tutorial, but for more experience you can look at the full-length GARLI tutorial that is used in the Workshop on Molecular Evolution: complete tutorial.

Doing this tutorial:

Download the package of files used in this activity: garliTutorial.Evo2014.zip
Uncompress the zip file, possibly by double clicking (if necessary, from OS X or Linux command line, use "unzip GarliDemo.WH2013.zip")
INDIVIDUAL EXERCISES WILL BE RUN IN SEPARATE FOLDERS, WITH NUMBERS CORRESPONDING TO THE EXERCISE NUMBERS.
YOU WILL NEED TO MAKE EDITS TO TEXT CONFIGURATION FILES FOR MOST EXERCISES.
PLEASE READ THE INSTRUCTIONS AND DON'T GET AHEAD OF YOURSELF! :-)

For the purposes of this tuorial, to start the GARLI program most easily:

Windows: Copy the executable to the directory containing the configuration file. Launch it by double clicking.
OS X: Start the program from the command line, in the directory containing the configuration file. You can either:
- specify the relative path: ../executables/Garli-OSX
- or, copy the executable to the directory containing the configuration file, and execute ./Garli-OSX

The GARLI configuration file

GARLI reads all of its settings from a configuration file. By default it looks for a file named garli.conf, but other configuration files can be specified as a command line argument after the executable (e.g., if the executable were named garli, you could type "garli myconfig.conf"). Note that most of the settings typically do not need to be changed from their default values, at least for exploratory runs.

The config file is divided into three parts: [general], [model] and [master]. The [general] section specifies settings such as the data set to be read, starting conditions and output files. To start running a new data set, the only setting that must be specified is the datafname, which specifies the file containing the data matrix in Nexus, PHYLIP, or FASTA format. The [master] section specifies settings that control the functioning of the genetic algorithm itself, and typically the default settings can be used. Basic template configuration files are included with any download of the program. The [model] section of the configuration file sets up the evolutionary model of substitution.

We will be using a small 29 taxon by 1218 nucleotide mammal data set that was taken from a much larger data set (44 taxa, 17kb over 20 genes) presented in (Murphy 2001). This dataset has been pared down such that little informative data remain, and chosen because it is small but difficult.

exercise 1: start a basic nucleotide run

First open and edit the configuration file:

Change to the 01-03.basicNucleotide directory
Open the garli.conf file in a text editor (you can try double clicking it and if possible, tell your operating system to always open this type of file with an appropriate text editor)

There is not much that needs to be changed in the config file to start a preliminary run. In this case, a number of changes from the defaults have been made so that the example is more instructive and a bit faster (therefore, do NOT use the settings in this file as defaults for any serious analyses). You will still need to change a few things yourself. Note that the configuration file specifies that the program perform two independent search replicates (searchreps = 2). Also note that taxon 1 (Opossum) is set as the outgroup (outgroup = 1).

Make the necessary changes:

Set datafname = murphy29.rag1rag2.nex
Set ofprefix = run1. This will tell the program to begin the name of all output files with "run1...".

Now we need to set the model of sequence evolution for GARLI to use. ModelTest has previously been run on this data set, and the best-fitting model under the AIC criterion is SYM+I+G (6 rates of substitution, equal base frequencies, an estimated proportion of invariant sites, gamma-distributed rate heterogeneity). This is close to the default GTR+I+G model set in the configuration file, so we only need to change the equilibrium base frequency settings.

Set statefrequencies = equal
Save the file
Start GARLI from the command line (OS X) or by double-clicking (Windows) in the 01-03.basicNucleotide directory. It will always look for a garli.conf file in the directory that you launch it from, regardless of where the executable actually is.

You will see a bunch of text scroll by, informing you about the data set and the run that is starting. Most of this information is not very important, but if the program stops be sure to read it for any error messages. The output will also contain information about the progress of the initial optimization phase, and as the program runs it will continue to log information to the screen. This output contains the current generation number, current best log-likelihood score, optimization precision and the generation at which the last change in topology occurred. All of the screen output is also written to a log file that in this case will be named run1.screen.log, so you can come back and look at it later.

exercise 2: monitor an ongoing run

(These are not things that you would normally need to do with your own analyses, but may be instructive here)

Look in the 01-03.basicNucleotide directory, and note the files that have been created by the run.
Open run1.log00.log in a text editor (you may be able to double-click it and choose to open it in an appropriate editor).
This file logs the current best log-likelihood, runtime, and optimization precision over the course of the run. It is useful for plotting the log-likelihood over time. Next, we will look at the file that logs all of the information that is output to the screen.
Open run1.screen.log in a text editor.
This important file contains an exact copy of the screen output of the program. It can be useful when you go back later and want to know what you did. In particular, check the "Model Report" near the start to ensure that the program is using the correct model.
Open run1.best.current.tre in Figtree or another tree viewer and examine the tree. (You may be able to double-click the file and associate .tre files with a tree viewer.)

This optional file contains the current best tree found, which changes over the course of a run. This file (not output by default) is updated every saveevery generations, so it is always easy to see the current best tree during a search. (Do not use this as a stopping criterion and kill the run when you like the tree though!)

exercise 3: inspect the final results of a run

(These are things that you would examine with your own analyses)

Hopefully at least one of the search replicates has finished by now.

The information that you really want from the program are the best trees found in each search replicate and the globally best across all replicates. After each individual replicate finishes, the best trees from all of the replicates completed thus far are written to the run1.best.all.tre file. When all replicates have finished, the best tree across all replicates is written to the run1.best.tre file.

The config files used here are set up to use a feature of the program that collapses internal branches that have a maximum likelihood estimated length of zero. This may result in final trees that have polytomies. This is generally the behavior that one would want. Note that the likelihoods of the trees will be identical whether or not the branches are collapsed.

When the two search replicates have completed, we can more closely examine the results.

First, take a look at the end of the run1.screen.log file. You will see a report of the scores of the final tree from each search replicate, an indication of whether they are the same topology, and a table comparing the parameter values estimated on each final tree.

There are two possibilities:

The search replicates found the same best tree. You should see essentially identical log-likelihood values and parameter estimates. The screen.log file will note that the trees are identical.
The search replicates found two different trees. This is almost certainly because one or both were trapped in local topological optima. You will notice that the log-likelihood values are somewhat different and the parameter estimates will be similar but not identical. The search settings may influence whether searches are entrapped in local optima, but generally the default settings are appropriate.

We could also evaluate and compare the results of our two search replicates in a program like PAUP*. Being able to open the results of one program in another for further analysis is a good skill to have.

One final note: obviously you could visually inspect your final trees in Figtree or another tree viewer, but that is not a quantitative comparison, and it can sometimes be hard to visually tell if trees differ.

exercise 4: use gap (insertion-deletion) models

GARLI implements two models appropriate for incorporating gap information into analyses of fixed alignments. These are the "DIMM" model (Dollo Indel Mixture Model) and a variant of the Mkv or "morphology" model. Both decompose an aligned matrix into two matrices: one that contains normal sequence data, and the other that encodes only the gaps as 0/1 characters. The two matrices are then analyzed under a partitioned model. The DNA portion is treated normally, while the 0/1 matrix is treated differently under the two models:

The DIMM model assumes a very strict definition of homology of gap characters, and does not allow multiple insertions within a single alignment column. This also implies a "dollo" assumption, meaning that after a base is deleted another cannot be inserted in the same column. This also means that it infers rooted trees. The DIMM model extracts very strong signal from the gap information, but also ends up being VERY strongly affected by alignment error.
The Mkv gap variant works on columns of 0/1 characters, but assumes that columns of entirely 0 (fully gap columns) will not be seen. This model DOES allow multiple transitions back and forth between 0 and 1 (gap and base). It does not extract as much information as DIMM, but is also not as sensitive to alignment error.

Preparing the data:

To use the DIMM or Mkv gap models we first need to create a "gap coded" matrix that is just a direct copy of the DNA matrix with gaps as "0" and bases as "1". This is done with an external program called gapcode. Scripts are provided to format nexus or fasta datasets for you, and to start the analyses. For the demo the dataset we'll use is called forTutorial.nex.

Format data for gap analysis: From the command line in the gapModels or gapModels-Win directory, type "./prepareNexusData.sh forTutorial.nex" or "prepareNexusData.bat forTutorial.nex". NOTE: Even on Windows, this requires command line usage. However, this part is optional for the tutorial (see below).
This will create some alignments in the preparedData directory. (These files are actually already there in the demo package, in case you have difficulty running gapcode.)
You can also use these same scripts to prepare your own nexus or fasta alignment.

Running GARLI gap models:

Now we run the DIMM and Mkv gap analyses by using other provided scripts:

type "./runGarli.dna+gapModels.sh" or on Windows double-click "runGarli.dna+gapModels.bat"
This will do several GARLI searches on the same data. First a quick run will be done to generate starting trees for the other analyses. Then DNA only searches will be run, as well as the DIMM and Mkv gap model searches.
The runs will create output that you can look at in the garliOutput directory. Looking at the XXX.screen.log files will tell you about the details of the run and parameter estimates. The XXX.best.tre files contain the best trees found in each search.
Note that the DIMM model indicates its inferred root by adding a dummy outgroup taxon named ROOT in the tree files. On OS X and Linux the prepareData scripts will create alignment files containing this taxon in the preparedData directory. I haven't managed to automate this on Windows.
If you use the above prepareData scripts on your own dataset, these runGarli scripts should also work properly to automatically analyze your data.
After runs have completed, take a final look at the screen log files and the inferred trees.

exercise 5: use a partitioned model

Partitioned models are those that divide alignment columns into discrete subsets a priori, and then apply independent substitution submodels to each. There are a nearly infinite number of ways that an alignment could be partitioned and have submodels assigned, so not surprisingly configuration of these analyses is more complex.

Note that although some models such as gamma rate heterogeneity allow variation in some aspects of the substitution process across sites, a model in which sites are assigned to categories a priori is more statistically powerful IF the categories represent "real" groupings that show similar evolutionary tendencies.

Running a partitioned analysis requires several steps:

Decide how you want to divide the data up. By gene and/or by codon position are common choices.
Decide on specific substitution submodels that will be applied to each subset of the data.
Specify the divisions of the data (subsets) using a charpartition command in a NEXUS Sets block in the same file as the alignment.
Configure the proper substitution submodels for each data subset.
Run GARLI.

Note that detailed instructions and examples are available on this page of the GARLI wiki:
Using partitioned models

On to the actual exercise...

In the 05.basicPartitioned directory, open murphy29.rag1rag2.charpart.nex in a text editor. Scroll down to the bottom of the file, where a NEXUS Sets block with a bunch of comments appears. Notice how the charset commands are used to assign names to groups of alignment columns. Notice the charpartition command, which is what tells GARLI how to make the subsets that it will use in the analysis.
Decide how you will divide up the data for your partitioned analysis. For this exercise it is up to you. There are a few sample charpartitions that appear in the datafile. If you want to use one of those, remove the bracket comments around it. If you are feeling bold, make up some other partitioning scheme and specify it with a charpartition. Save the file.
Now we tell GARLI how to assign submodels to the subsets that you chose. Following is a table of the models chosen by the program Modeltest for each subset of the data. Look up the model for each of the subsets in the partitioning scheme that you chose. Don't worry if you don't know what they mean.

sites rag1 rag2 rag1+rag2

all GTR+I+G K80+I+G SYM+I+G

1st pos GTR+G SYM+G GTR+I+G

2nd pos K81uf+I+G TrN+G GTR+I+G

3rd pos TVM+G K81uf+G TVM+G

1st+2nd GTR+I+G TrN+I+G TVM+I+G
In the 05.basicPartitioned directory, open the garli.conf file. Everything besides the models should already be set up. Scroll down a bit until you see several sections headed like this: [model0], [model1]. This is where you will enter the model settings for each subset, in normal GARLI model format, in the same order as the subsets were specified in the charpartition. The headings [model1] etc MUST appear before each model, and MUST begin with model 1. For example, if you created 3 subsets, you'll need three models listed here. Open the garli_model_specs.txt file. This file will make it much easier to figure out the proper model configuration entries to put into the garli.conf file.
In the garli_model_specs.txt file, find the models that appeared for your chosen subsets in the table above. For example, if I was looking to assign a model to rag2 2nd positions, the model from the table would be "TrN+G". Find the line that reads "#TrN+G" and copy the 6 lines below it. Now paste those into the garli.conf file, right below a bracketed [model#] line with the proper model number.
Start partitioned GARLI.
Peruse the output in the .screen.log file, particularly looking at the parameter estimates and likelihood scores. Note the "Subset rate multiplier" parameters, which assign different mean rates to the various subsets. Note that the likelihood scores of the partitioning scheme that you chose could be compared to the likelihoods of other schemes with the AIC criterion. Details on how to do that appear on the partitioning page of the garli wiki:
Using partitioned models

That's it for this tutorial. Feel free to go through the more complete tutorial mentioned earlier. Always feel free to email me (Derrick Zwickl) if you have quesions or problems (garli.support{at}gmail.com).

sites	rag1	rag2	rag1+rag2
all	GTR+I+G	K80+I+G	SYM+I+G
1st pos	GTR+G	SYM+G	GTR+I+G
2nd pos	K81uf+I+G	TrN+G	GTR+I+G
3rd pos	TVM+G	K81uf+G	TVM+G
1st+2nd	GTR+I+G	TrN+I+G	TVM+I+G