Summarizing trees¶
The Centroid class¶
class tetres.summary.Centroid(variation="greedy_omp", n_cores=None, select='random', start='FM', subsample_size=200,
tree_log_file="", max_iterations=None)
This is used to setup a Centroid object which then takes a TimeTreeSet
as input to compute the centroid summary tree.
Variation¶
The variation parameter of a Centroid
has to be one in [“inc_sub”, “greedy”] (TODO: Still WIP).
Variation |
Description |
---|---|
Computes a centroid via the greedy path and neighbourhood search. Only considering the tree with the most imporved SoS value in each iteration. |
|
Starts with a subsample of trees from the set, computes the greedy_omp centroid variant and adds more trees to the subsample until all trees are part of the sample. |
|
Starts with a subsample of trees from the set, computes the greedy_omp centroid variant and then resamples a new subset, using the previous centroid as the starting tree. |
|
Only computes rank move neighbours if the tree contains all common clades of the tree set |
|
Prefers either NNI or Rank moves and switches this if a local optimum is reached |
|
Similar to the incsub variation, only one tree at a time is added to the subsample |
|
Mimicks an online approach where samples arrive one after another and the centroid is computed after each sample starting from the previous centroid |
Greedy¶
tetres.summary.Centroid(variation="greedy_omp") # default, using multiple processes in c!
tetres.summary.Centroid(variation="greedy") # pure python version
Inc_sub¶
The parameter subsample_size defines the size of the subsample of trees that is added each iteration. The parameter max_iterations defines the number of iterations, if it is None the regular break is defined whenever an iteration is not successful at improving the previous centroid. If it is an integer then it defines the number of iterations that will subsample, if it is 0 the start tree will be returned.
tetres.summary.Centroid(variation="inc_sub", subsample_size=500, max_iterations=None)
Iter_sub¶
The parameter subsample_size defines the size of the subsample of trees that is sampled each iteration. The parameter max_iterations defines the number of iterations, if it is None the regular break is defined whenever an iteration is not successful at improving the previous centroid. If it is an integer then it defines the number of iterations that will subsample, if it is 0 the start tree will be returned.
tetres.summary.Centroid(variation="iter_sub", subsample_size=500, max_iterations=None)
Separate¶
Will only use one move, current implementation is for NNI moves only, needs to be switched in source code (_variations.py, line 147).
tetres.summary.Centroid(variation="separate")
Onlyone¶
Will always do one move (starting with rank moves as of current implementation) and switch the move type whenever a local optimum is found.
tetres.summary.Centroid(variation="onlyone")
update-with-one¶
Similar to the inc-sub variation but only one new tree is added in each iteration.
tetres.summary.Centroid(variation="update_with_one")
Online¶
Mimicks an online approach where the trees arrive one by one in the given order.
tetres.summary.Centroid(variation="online")
Selecting a tree¶
This is only the case if multiple trees have the same SoS value. The defualt is random and the options are either random, first or last. The second two options are depending on the ordering which is dictated by the way the neighbourhood is computed.
Starting tree¶
There are the options to start with the last, the first or any given index of tree from the given tree set. The default option however is the sorted Frechet Mean tree (ref), see the doc on FM for more detail.
Subsample size¶
This is used by some variations and can be set to any integer number (default is 200). This number indicates the size of the subsample that the variation will use in its iterations. See the incsub or itersub variations
Maximal iterations¶
This is used to limit the number of iterations the iterative subsampling and increasing subsampling centroid versions are computing. If it is None (default) then the regular break points of those variations apply, otherwise it will only compute upto max_iteration many iterations.
Computing the SoS¶
The n_cores parameters defines the number of cores to use, if -1 all available cores are used (default).
Tree logfile¶
This option will write the trees of each centroid iteration to the given file path. This includes the actual centroid as the last tree. Can be used for further analysis.
Note that for incsub for example the tree is logged after an iteration on the subsample. This results in much smaller log files.
Annotation of a centroid¶
To keep the discrete ranks of a centroid use this annotation method. Each rank get assigned the average height of that rank in the given tree set, guaranteed to keep the same ranked tree after the annotation.
from tetres.summary.annotate_centroid import annotate_centroid
cen, sos = Centroid(variation="greedy").compute_centroid(my_tts) # centroid of the TimeTreeSet my_tts
annotated_cen = annotate_centroid(cen, my_tts) # Annotation with the branch lengths from the TimeTreeSet my_tts
# the annotated_cen is a TimeTree object for further use such as writing the newick to a file
Frechet Mean¶
A version of Sturms algorithm adapted to the RNNI tree space, based on computing shortest paths with the findpath algorithm.
from tetres.summary.frechet_mean import frechet_mean, frechet_mean_sort
fm = frechet_mean(my_tts) # random selection of trees
fm_sort = frechet_mean_sort(my_tts) # trees are sorted from highest to lowest Sum of squared distances
# The idea is that tres with low sum of squared distances are used in the end of the algorithm to refine the tree