########
Overview
########

http://phylo.bio.ku.edu/software/sate/sate.html

SATe is a tool for producing trees and alignments from unaligned sequence data.
It iterates between alignment and tree estimation, so that each iteration
creates an alignment using a divide-and-conquer strategy of the maximum
likelihood (ML) tree from the ML tree obtained in the previous iteration, and
then computes a new ML tree on the new alignment.

The original algorithmic approach is described in:

    Kevin Liu, Sindhu Raghavan, Serita Nelesen, C. Randal Linder, and  Tandy
    Warnow. "Rapid and Accurate Large-Scale Coestimation of Sequence Alignments
    and Phylogenetic Trees" Science. 2009. Vol. 324(5934), pp. 1561- 1564.
    DOI: 10.1126/science.1171243

The algorithmic approach used in the current software is described in:

    Kevin Liu, Tandy Warnow, Mark T. Holder, Serita Nelesen, Jiaye Yu, Alexis
    Stamatakis, and C. Randal Linder. "SATe-II: Very Fast and Accurate
    Simultaneous Estimation of Multiple Sequence Alignments and Phylogenetic
    Trees."  Systematic Biology, 61(1):90-106, 2011.


The SATe software is written by Jiaye Yu, Mark T. Holder, Jeet Sukumaran,
Siavash Mirarab, and Jamie Oaks, and uses the Dendropy library of Sukumaran and
Holder.


#######
Caveats
#######

SATe software is currently available for testing purposes.

Please check your results carefully, and contact us if you have questions,
suggestions or other feedback.

We are aware that the error-reporting needs work. If the software fails to
produce output files despite the fact it announces that it is finished, then an
error has occurred.  We are working on having SATe give useful error messages.
In the meantime, please contact us for help if you experience problems running
SATe.

Temporary files: SATe uses a .sate directory in your HOME directory to store
temporary results.  In general the GUI tries to clean up after itself, but you
may want to check that location if you think that SATe has been using too much
hard disk space.


################################
Graphical user interface version
################################

The graphical user interface (GUI) for SATe gives you access to most of the
available options for running the software. Below are brief descriptions of the
settings that you can control via the GUI.


###################
Starting conditions
###################

If you give SATe a starting tree, it will go directly to the iterative portion
of the algorithm.

If you do NOT give it a starting tree, then SATe will use the specified "Tree
estimator" external tool to infer the initial tree.  This requires an alignment,
and you can provide an alignment as input to SATe.   If you do not provide an
alignment, then SATe will use the alignment tool that you have selected to
produce an initial alignment for the entire dataset (this can be slow).

If the initial alignment is very slow, you might want to use the PartTree tool
in MAFFT (http://bioinformatics.oxfordjournals.org/content/23/3/372.abstract) to
estimate a rough starting tree. By providing SATe with the tree estimated by
PartTree, your analysis will bypass the initial alignment/tree-search, and will
immediately begin the first iteration of the SATe algorithm.

    Soon, we will implement an option that allows you to specify an aligner for
    the initial alignment operation and a different aligner for the subproblem
    alignment operations. In the meantime, if you want a "quick and dirty"
    alignment for the initial tree searching, you will need to produce this
    alignment yourself and then give it to SATe.


###########################
External Tools (upper left)
###########################

During each iteration SATe breaks down the tree into subproblems, realigns the
data for each subset, merges the alignments into a full alignment, and
re-estimates the tree for the full alignment.

In the external tools section of the application you can choose the software
tools used for each step:

    * "Aligner" is used to select the multiple sequence alignment tool used to
      produce the initial full alignment (this can be slow!), and to align the
      subproblems.

    * "Merger" is used to select the multiple sequence alignment tool used to
      merge the alignments of subproblems into a larger alignment.

    * "Tree Estimator" will allow you to choose the software for tree inference
      from a fixed alignment.
    
    * "Model" allows you to select the substitution model that will be used
      by the tree estimator during tree inference. The options in the drop down
      are contingent on the specified "Tree Estimator".


################################
Sequences and Tree (middle left)
################################

    * If you are running a single locus analysis, leave the "Multi-Locus Data"
      box unchecked. Check this box if you want to run SATe with multiple loci.
      In multi-locus mode, during each iteration SATe aligns each locus
      separately, and then concatenates the alignments for a multi-locus tree
      search.

    * If "Multi-Locus Data" is unchecked, clicking the "Sequence file..." button
      will allow you to select the input sequences in a FASTA-formatted file. If
      "Multi-locus Data" is checked, clicking the "Sequence file..." button will
      allow you to select the directory where the fasta files for each locus are
      located.
      
    * NOTE: In multi-locus mode SATe will process ONLY files in the designated
      directory that end in ".fas" or ".fasta", and will treat each as a
      separate locus. All other files and directories will be ignored.
    
    * NOTE: SATe version 2.2.0 or later automatically determines good
      analysis settings based on the size of the dataset(s) read with the
      "Sequence file..." button.  Thus, it is best to READ YOUR DATA FIRST,
      before setting other options, because settings will change when you read
      in your data.  It is still encouraged that you explore settings, but this
      new feature will provide a good starting point based on the amount of
      data.

    * Clicking the "Tree file (optional)..." button will allow you to select a
      file with a NEWICK (Phylip) representation of the tree.  If you give SATe
      a starting tree, then it will not align the full dataset before the first
      iteration. Because the initial alignment of the full dataset can be quite
      slow, specifying a starting tree can dramatically reduce the running time.

    * Use the "Data type" drop down menu to specify whether the data should be
      treated as DNA, RNA, or amino acid sequences (because of the 15 IUPAC
      codes for ambiguous states for DNA, it can be difficult to detect the
      datatype with absolute certainty).


##############################
Workflow Settings (lower left)
##############################

    * Checking the "Two-Phase" algorithm will cause SATe to only perform an
      initial alignment and tree search and return the results.  It will NOT
      perform the SATe decomposition-merge algorithm.  This is the same as
      running the "Aligner" and "Tree Estimator" on your own.
      
    * Checking the "Extra RAxML Search" post-processing option will cause SATe
      to perform a final RAxML search on the alignment returned by the SATe
      algorithm. This only makes sense if you are using a "Tree Estimator" other
      than RAxML.


##########################
Job Settings (upper right)
##########################

    * "Job Name" allows you to specify the prefix for all files output by SATe.
      Files tagged with this name will appear in the output directory when the
      run completes.
    
    * Clicking the "Output Dir." button will allow you to choose the output
      directory to which to save the alignments and trees returned by SATe. If
      you leave this blank, by default, the results will be written to the same
      directory as the source data file(s).

    * "CPU(s) Available" allows you to specify how many processors should be
      dedicated to the alignment tasks of SATe. If you have a dual-core machine,
      then choosing 2 should decrease the running time of SATe because
      subproblem alignments will be conducted in parallel. In general, for the
      fastest performance, set this equal to the number processors in your
      machine.

    * "Max. Memory (MB)" lets you specify the size of the Java Virtual Machine
      (JVM) heap to be allocated when running Java tools such as Opal. This
      should be as large as possible. If you get errors when running Java tools,
      one possible reason might be that you have allocated insufficient memory
      to the JVM given the size of your dataset. By default, the memory defaults
      to 1024 MB (versions of SATe prior to 2.0.3 had a default of 2048 MB, and
      did not allow the option of changing this).


###########################
SATe Settings (lower right)
###########################

The options in this panel allow you to control the details of the algorithm.

During each iteration, the dataset will be decomposed into non-overlapping
subsets of sequences, and then these subproblems are given to the alignment
tool that you have chosen.

    * The "Max. Subproblem" settings control the largest dataset that will be
      aligned during the iterative part of the algorithm.  Use the "Fraction"
      button and the associated drop-down menu if you would like to express the
      maximum problem size as a percentage of the total number of taxa in the
      full dataset (e.g. 20 for "20%").

    * If you want to express the size cutoff in absolute number of sequences,
      use the "Size" button and its drop-down menu.

    * "Decomposition" allows you to select the procedure used to find the edge
      that should be broken to create subproblems.

    * The "Stopping Rule" section allows you to control how SATe decides that it
      is done. The decision to stop the run can be done based on the number of
      iterations ("Iteration Limit" settings) or the amount of time in hours
      (the "Time Limit (hr)" settings).

    * If you choose "Blind Mode Enabled", SATe will accept tree/alignment
      proposals even if they do not improve the ML score. If the "blind" mode is
      not in effect, then only pairs with a higher ML score will be accepted.
      
    * The "Apply Stop Rule" drop down allows you to designate when the stopping
      should be applied or reset.  When you are running in "blind" mode, you can
      elect to have the stopping rule count the number of iterations (or the
      time) over the entire run ("After Launch"), or you can use a termination
      condition that is based on the progress since the last improvement in ML
      score ("After Last Improvement"). For example, if you choose "Blind Mode
      Enabled", an "Iteration Limit" of 1, and "After Last Improvement", then
      SATe will terminate if it even completes one iteration without improving
      the ML score. The effect of this will be that SATe iterations act like a
      strictly uphill climber in terms of the ML score.

    * The "Return" drop down allows you to designate whether the tree (and 
      corresponding alignment) returned is the one with the "Best" ML score,
      or the one from the last or "Final" SATe iteration.


####################
Notes for Developers
####################
Putting:
    SATE_DEVELOPER=1
in your environment will display full stack traces on error exits.

Putting:
    SATE_LOGGING_LEVEL=debug
in your environment will display debugging level logged messages

Putting:
    SATELIB_TESTING_LEVEL=EXHAUSTIVE
will cause more tests to be executed when you run:
    $ python setup.py test


####################
Acknowledgments
####################

Code for OptionParsing was taken from Tim Chase's post on:
    http://groups.google.com/group/comp.lang.python/msg/09f28e26af0699b1