Skip Navigation

Systematic Biology 2007 56(5):727-740; doi:10.1080/10635150701611134
This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (PDF) Freely available
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in ISI Web of Science
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrow Search for citing articles in:
ISI Web of Science (7)
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Whelan, S.
Right arrow Search for Related Content
PubMed
Right arrow Articles by Whelan, S.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?

© 2007 Society of Systematic Biologists

New Approaches to Phylogenetic Tree Search and Their Application to Large Numbers of Protein Alignments

Edited by Thomas Buckley

Simon Whelan1,2

1 University of Manchester, Faculty of Life Sciences, Michael Smith Building Oxford Road, Manchester, M13 9PT, UK E-mail: simon.whelan{at}manchester.ac.uk
2 EMBL—European Bioinformatics Institute, Wellcome Trust Genome Campus Hinxton, Cambridge, CB10 1SD, UK


    Abstract
 Top
 Abstract
 Tree Estimation Heuristics
 Algorithms and Methods
 Results
 Conclusions
 Appendix 1
 References
 
Phylogenetic tree estimation plays a critical role in a wide variety of molecular studies, including molecular systematics, phylogenetics, and comparative genomics. Finding the optimal tree relating a set of sequences using score-based (optimality criterion) methods, such as maximum likelihood and maximum parsimony, may require all possible trees to be considered, which is not feasible even for modest numbers of sequences. In practice, trees are estimated using heuristics that represent a trade-off between topological accuracy and speed. I present a series of novel algorithms suitable for score-based phylogenetic tree reconstruction that demonstrably improve the accuracy of tree estimates while maintaining high computational speeds. The heuristics function by allowing the efficient exploration of large numbers of trees through novel hill-climbing and resampling strategies. These heuristics, and other computational approximations, are implemented for maximum likelihood estimation of trees in the program Leaphy, and its performance is compared to other popular phylogenetic programs. Trees are estimated from 4059 different protein alignments using a selection of phylogenetic programs and the likelihoods of the tree estimates are compared. Trees estimated using Leaphy are found to have equal to or better likelihoods than trees estimated using other phylogenetic programs in 4004 (98.6%) families and provide a unique best tree that no other program found in 1102 (27.1%) families. The improvement is particularly marked for larger families (80 to 100 sequences), where Leaphy finds a unique best tree in 81.7% of families.

Keywords: Algorithms; evolution; phylogenetic tree inference; tree estimation heuristics

Received July 12, 2006; Revised October 17, 2006; Accepted April 6, 2007


Inferring the evolutionary relationship within a set of homologous sequences is fundamental to a wide range of biological sciences (e.g., Felsenstein, 2004; Nielsen, 2005). In evolutionary biology the topology of the tree is often of primary interest and has proved edifying in many problems, from demonstrating the zoonoses of immunodeficiency viruses from chimpanzees to humans (e.g., Hahn et al., 2000), to insights into the tree of life (e.g., Ciccarelli et al., 2006). In other disciplines the tree estimate is the foundation upon which other results are built. In comparative genomics, for example, accurate tree estimates are required for the identification of conserved elements (e.g., Siepel et al., 2005), whereas in structural biology evolutionary similarity may be taken to be analogous to functional similarity (e.g., Thornton and Orengo, 2005). Consequently, there has been much research into developing and improving the methodology used to infer trees (see Swofford et al., 1996; Felsenstein, 2004).

Approaches to estimating trees differ in the way they use the information held in sequences and how they draw inferences from it. This study treats phylogenetic tree estimation as a problem of statistical inference (Yang et al., 1995) and aims to develop heuristics that are effective at finding the best (optimal) tree for describing the evolution of a set of aligned sequences under standard maximum likelihood (ML) approaches. Rogers (1997) showed that ML was statistically consistent, which means that the optimal tree will tend towards the true tree as the length of sequences examined increases, conditional on the accuracy of the evolutionary model used. Typical phylogenetic studies may not recover the true tree because only finite data are available for inference. Instead, it is advisable to construct a confidence set of trees around the globally optimal tree within which the true tree would be expected to fall with a controllable degree of error (see Goldman et al., 2000; Kosiol et al., 2006), although calculating confidence intervals is beyond the scope of this manuscript.

Finding the globally optimal tree(s) relating a set of aligned sequences requires assessing the likelihood of all possible trees (a set referred to as tree space) and choosing the best. The size of tree space grows rapidly with the number of sequences examined; for 50 sequences there are approximately 1076 rooted trees, a number comparable to the estimated number of atoms in the observable universe. Branch-and-bound algorithms can reduce the number of trees that need to be examined to find the global optima, but in typical circumstances still require the computation of an unfeasibly large number of trees. This necessitates the use of tree estimation heuristics (TEHs) for searching tree space, which are not guaranteed to find the globally optimal tree but are hoped to perform relatively well with respect to recovering either the true tree or features therein (e.g., bipartitions in the tree). The majority of TEHs can be considered as a partial or complete subset of the four-step process shown in Figure 1 (see Tree Estimation Heuristics below).


Figure 1
View larger version (63K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Figure 1 Schematic description of tree estimation heuristics.

 
This paper introduces a set of TEHs with several novel features. I introduce the SNAP family of refinement schemes that substantially increases the number of trees considered during refinement (see Fig. 1) relative to the popular nearest-neighbor interchange scheme (Swofford et al., 1996). This is expected to reduce the number of local optima while maintaining a computational complexity linear to the number of sequences. I also introduce and implement the partial stepwise addition (PSA) resampling scheme, based upon an approximation to stepwise addition (Swofford et al., 1996), which demonstrably improves coverage of sequence space. Finally, I introduce an application of tabu searching (e.g., Glover et al., 1993; Charleston, 2001) to maximum likelihood phylogenetic search. These ideas are implemented in the software Leaphy (Likelihood estimation algorithms for phylogenetics) and their performance, along with a series of further approximations that significantly improve computational performance, is systematically examined. Implementations of these heuristics are compared with leading phylogenetic programs by estimating trees from 4059 protein sequence alignments. The results demonstrate that Leaphy performs better than other comparable programs, with the caveat that for large phylogenies it remains valuable to use multiple programs for phylogenetic tree estimation.


    Tree Estimation Heuristics
 Top
 Abstract
 Tree Estimation Heuristics
 Algorithms and Methods
 Results
 Conclusions
 Appendix 1
 References
 
The types of heuristics used to obtain point estimates of phylogenetic trees depend on the methodology used. Those based only on pairwise distances usually take the output of a clustering algorithm as their tree estimate, whereas score-based (optimality criterion) methods, such as maximum likelihood and maximum parsimony, often have quite sophisticated multistep heuristics. The majority of existing TEHs can be reduced to the four components shown in Figure 1, although not all the steps are employed by all phylogenetic programs. Below, I briefly describe the purpose of each component of the heuristic and discuss the benefits and disadvantages of existing algorithms.

Initial Tree Proposal
Description
The initial proposed tree is the starting point that TEHs work from. For programs that use only a clustering approach based on pairwise distances, such as neighbor-joining (Saitou and Nei, 1987) and its derivatives (e.g., Gascuel, 1997; Bruno et al., 2000), this is the only step required (see Fig. 1). For other programs using score-based methodology, this initial step can be critical to the overall performance of a TEH. If the proposal mechanism consistently gets close to the optimal tree, the whole heuristic will probably perform well; if the proposal mechanism frequently produces trees very different from the optimal tree, such as a random tree proposal may do, then the TEH will suffer.

Current algorithms
Initial tree proposal usually consists of a relatively simple and fast clustering method based on pairwise distances or hierarchical clustering using the full scoring method, such as stepwise addition and star decomposition (Swofford et al., 1996; Felsenstein, 2004).

Tree Refinement
Description
The purpose of tree refinement is to move around tree space looking for "better" trees, which are those that give a higher score according to whatever criterion (e.g., likelihood or parsimony) is being used. Refinement frequently consists of a deterministic greedy hill-climbing algorithm that iteratively optimizes the score of the tree and is the usual topic when TEHs are discussed in the phylogenetic literature. In each iteration, a series of rearrangements to the tree are made to produce a list of candidate trees, referred to as the neighborhood of the refinement scheme. The score for each tree in the neighborhood is calculated and the tree that maximizes the score is chosen as a starting point for the next iteration. The algorithm terminates when the top of the hill is reached, which occurs when no better score can be found in the neighborhood. There is, unfortunately, no way to decide whether this hilltop is the globally optimal tree or one of the multiple local optima that typically feature in the landscape of tree space. An effective refinement scheme needs to be able to move rapidly through tree space to identify good trees quickly and reduce the frequency of local optima but also remain near to linear in computation time with the number of sequences, making them practical for the large data sets that evolutionary biologists typically want to examine. Few widely used refinement schemes satisfy both these requirements.

Other refinement schemes introduce a random element to the hill climbing approach, with popular versions including genetic algorithms (Lewis, 1998; Zwickl, 2006) and simulated annealing (Lundy, 1985; Salter and Pearl, 2001).

Current algorithms
The most popular refinement scheme is probably NNI (e.g., Swofford et al., 1996). This breaks each branch of the tree in turn and examines the three tree topologies that can be formed by rejoining the four subtrees, one topology of which, of course, is the original. An important property of NNI is that the number of trees in the neighborhood per iteration increases linearly with the number of leaves of the tree, making NNI practical for very large trees. For n species there are n – 3 branches in an unrooted tree; each of these branches when broken proposes two new trees, resulting in 2n – 6 trees in the neighborhood. The relatively small neighborhood size of NNI results in small steps in tree space and a tendency for large numbers of local optima.

Two expanded versions of NNI are also widely used: Subtree Pruning and Regrafting (SPR) and Tree Bisection and Reconnection (TBR). In common with NNI, they examine the trees produced by breaking each branch in turn but use more expansive rearrangements to vary the way the subtrees are put back together. With SPR, the neighborhood is produced by examining all of the ways the broken branch of one subtree can be regrafted onto the other. The neighborhood of TBR is larger still and is produced by examining all the possible ways that branches in one subtree can be connected to the other (for further details of SPR and TBR see Swofford et al., 1996). With SPR and TBR, the neighborhoods grow at a greater than linear rate with the number of sequences in the tree, with faster growth under TBR than SPR. This rapid increase in neighborhood size makes both schemes extremely computationally time consuming for large datasets, although due to a lack of viable alternatives they remain frequently used. Some programs have also introduced a bound on how far SPR can move a subtree, controlling the relationship between the size of a neighborhood and the number of species in a tree (e.g., Stamatakis et al., 2005).

Other refinement schemes exist, such as edge-contract-and-refine (Sankoff et al., 1994; Ganapathy et al., 2004) and disk-covering methods (Huson et al., 1999), but are rarely used.

Resampling
Description
The aim of resampling is to increase the coverage of tree space by moving the TEH to different and potentially interesting regions in an attempt to move away from local optima. Typically, a resampling step consists of a move in tree space that has the potential to radically alter the current tree topology. Depending on the scheme used, this may reduce the score of the current estimate with the hope that subsequent rounds of refinement will find an even better tree. A good resampling scheme will sample highly scoring trees (regions of tree-space) more often than poorly scoring trees (regions) and allow all trees to be visited with a non-zero probability. Resampling may be particularly effective when there are multiple optima, where a random element in a resampling scheme combined with deterministic refinement may allow the TEH to visit many good optima in potentially quite different regions of tree-space.

Current algorithms
Resampling tree space is a relatively underused aspect of tree inference and not frequently discussed in the phylogenetic literature. It has the potential to significantly improve the quality of tree estimates and the idea was present in some of the original phylogenetic programs. For example, it has been common practice when using dnaml to run the stepwise addition algorithm multiple times with different input orders to produce different starting points for the refinement step (Felsenstein, 2005). More recent innovations include important quartet puzzling, which plucks a number of sequences from a tree and reintroduces them in a random order using an approximation to a full likelihood approach (Vinh and von Haeseler, 2004), and ratchet approaches, which use bootstrap resampling of the data to explore new regions of tree space (Nixon, 1999; Vos, 2003).

Related algorithms
A common alternative approach to this purist random resampling is to combine two or more deterministic refinement heuristics, the quicker of which (e.g., NNI) is used for the refinement step, and an alternative with a slower more expansive scheme (e.g., TBR) being used rarely to attempt to make larger steps in tree space. Despite their large neighborhoods, these combined deterministic approaches may remain prone to local optima because they still only examine a tiny proportion of tree space.

Stopping
The stopping rule is required to ensure the algorithm terminates. The simplest rule is to terminate when no improvement in score can be found, which is used when there is no resampling or when a resampling step using a deterministic approach (e.g., SPR or TBR; see above) cannot find an improvement in likelihood, an approach used by the majority of phylogenetic tree estimation programs. In more sophisticated approaches the termination rule is often arbitrary, such as a prespecified number of resampling steps, although more elaborate rules have been suggested that try and predict when an algorithm will finish (e.g., Vinh and von Haeseler, 2004). After terminating, the best scoring tree identified during the whole TEH is returned as the heuristics estimate of the optimal tree.


    Algorithms and Methods
 Top
 Abstract
 Tree Estimation Heuristics
 Algorithms and Methods
 Results
 Conclusions
 Appendix 1
 References
 
Refinement: Symmetric Neighborhood Alterations to Phylogenies (SNAP)
The first of the developments to TEHs I introduce in this study is SNAP, which is a family of refinement schemes that significantly increases the size of the neighborhood while maintaining a linear relationship between computational speed and the number of sequences. In contrast to the majority of commonly used approaches that focus on breaking each branch of a tree in turn and making rearrangements to the resultant subtrees, SNAP produces its neighborhood of candidate trees, T, by systematically removing regions of the tree to produce a set of subtrees and listing all of the different ways of putting them back together again. To define the regions to be removed in each refinement step, I select a centering point (CP) and a radius. Two forms of CP can be used: node CPs (Fig. 2a) or branch CPs (Fig. 2b). The radius describes the area around the CP from which the tree is removed and two types of radius are demonstrated in Figure 2: one based on the number of edges away from the CP (depth) and one based on evolutionary distance (r). Each removal leaves a number, m, of orphan subtrees from which to form T, each orphan subtree having a node of degree 2 specifying where it will be reattached. A special case occurs when the removal covers leaves of the original tree. Each leaf is treated as a separate orphan subtree consisting of a node of degree 0. To produce a final list of trees a fixed radius is taken and all possible branch or node CPs in a tree are considered. If the ith CP produces a set of trees Ti, the final, nonredundant list of trees is produced by T = {cup}iTi. In this case, | T| is the number of trees in the neighborhood.


Figure 2
View larger version (81K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Figure 2 Pictorial description of SNAP based on a node CP x (a) and a branch CP e (b). In each case, the tree shown on the left may be reduced to a set of subtrees of size m according to an edge-based or evolutionary distance-based radius.

 
Specifying a radius of a particular node depth, d, controls the size of |T| because it defines the maximum number of subtrees produced by the removal. For a given d the maximum number of orphan subtrees, m, is straightforward to calculate. For a branch CP, m = 2d + 1, while for node CP, m = 2x 3d – 1. The number of orphan subtrees, and consequently the size of |T|, is reduced when the removal covers leaf nodes in the original tree. Reconstructing the evolutionary relationship between orphan subtrees is analogous to a standard phylogenetic problem of size m, and the number of possible trees is given by (2m 5)!! (Felsenstein 2004). For example, a branch CP with d = 1 produces m = 4 orphan subtrees that have 3!! = 3 possible tree topologies describing their relationship, a neighborhood of rearrangements identical to that produced by NNI. Note that SPR and TBR have no equivalent under SNAP.

Alternatively, an evolutionary distance-based radius can be used to specify the region to remove from the tree. This does not give full control over |T| because it does not limit the maximum number of subtrees produced by the removal. For any positive value of d it is possible to construct a tree where the orphan nodes correspond exactly to the original sequences by setting all the internal branches to length 0 to produce a large multifurcation (polytomy), which would make the subtree problem equivalent to estimating the whole tree. In reality, the problem of estimating trees from identical sequences is not interesting to an evolutionary biologist, but it demonstrates the real and persistent problem that large polytomies in trees would cause such an approach. An alternative approach is to specify d and progressively remove branches of the tree in the order of their distance from the CP, deciding ties randomly, until a predefined number of branches have been removed. This approach seems viable but is not pursued further in this study.

Computation, implementation, and approximations
In this study, a single variant of the SNAP family of TEHs was implemented for maximum likelihood tree estimation using a node CP and d = 2 (Fig. 2a, left and center-right). This choice is particularly appropriate for comparing the general properties of using SNAP relative to NNI and for simple likelihood computation. It increases the number of orphan subtrees from four under NNI to six under SNAP, increasing the maximum size of |Ti| from 3 to 105 topologies, which is a feasible number to compute by brute force.

To further improve the computational performance a number of approximations to this full likelihood approach were made. These improve performance by enabling multiple refinement steps to occur at once and/or speeding up each individual likelihood computation, which reduces the time spent numerically optimizing the likelihood function. The first and second approximations combine to achieve both of these goals. Instead of constructing all of T and then assessing the likelihood of each tree, it is useful to sequentially consider each Ti and compute partial likelihoods of all of the m orphan subtrees. This allows these subtrees to be considered as the leaves of a tree and enables Ti to be computed as though it consisted of only m species. This approximation is similar in principle to the one used in phyml, where NNI is speeded up by calculating the partial likelihoods of subtrees (see Guindon and Gascuel, 2003, for details).

The second approximation concerns how the best trees in Ti are combined to propose a new tree as a starting point for the next round of iteration. Instead of accepting only the single best tree from T, multiple SNAPs are allowed. This can speed up convergence to an optimal topology. Every tree topology in T can be described in terms of rearrangements to the original tree topology. The rearrangements required to transform the original tree topology to the best tree in Ti are defined as {tau}i and have an associated likelihood of Li, with {tau} = { {tau}1,{tau}2,...,{tau}n – 3} and L = {L1,...,Ln – 3}. A special case occurs when the best tree in Ti is the same as the original tree; in these cases {tau}i contains no rearrangement information and, before any further computation, {tau}i and Li are removed from the sets {tau} and L, respectively. To obtain the tree for the next round of refinement a series of rearrangements from {tau} are iteratively applied to the original tree, continuing until {tau} is empty. Firstly, the rearrangements {tau}i associated with the highest likelihood in L are performed on the tree and these values are removed from {tau} and L. Each application of a {tau}i is the equivalent of performing a SNAP rearrangement on the tree and if only a single round of iteration were performed it is exactly the same as accepting the tree in T with the highest likelihood. Secondly, this active set of rearrangements may overlap other rearrangements that are held in {tau} and consequently these are no longer applicable. To ensure these invalid rearrangements are not performed, all elements of {tau} whose CP node falls within four edges of the active rearrangement, and their associated likelihoods in L, are also removed. These two steps are repeated until {tau} and L are empty.

The final approximation is novel, remarkably simple, and affects only numerical optimization. Finding the ML for up to 105 trees is the most computationally expensive step in SNAP and typically requires large numbers of likelihood calculations. Briefly, the approximation works by quickly generating an approximate rank order of the likelihoods for the 105 trees and then finds the ML topology by performing a full numerical calculation on only a small fraction of the highest ranking: 5 of the 105 possible topologies for the implementation presented in this study. More complete details and justification of this approximation are located in Appendix 1.

The main problem that any of the three approximations above face is the acceptance of a suboptimal tree during refinement. None of the approximations described were found to significantly impact the quality (likelihood) of the final estimated tree (results not shown).

Resampling: Partial Stepwise Addition
The second improvement I introduce to TEHs is a new resampling scheme based on stepwise addition. Initially a number of sequences are plucked from a tree, with the choice of sequences based on their fit to the current tree topology. The topology of the remaining sequences may then be rearranged and the sequences are replaced using the PSA algorithm. The algorithm and methodology for PSA are intimately linked, which is reflected in the following discussion.

Sequence plucking
To perform PSA, a number of sequences need to be removed from the tree. This number, r, is randomly drawn from a binomial distribution according to Pr (X = r) = Formula pr(1 – p)nr, where n is the number of sequences in the tree and the expected number of sequences to be removed, E(r) = np, is controlled by the variable p. A small empirical study indicated that p = 0.33 performed well, so in Leaphy on average one third of sequences is removed from the tree before PSA. Once r is specified, the choice of which sequences to remove needs to be decided. To enable the algorithm to efficiently explore good regions of tree space, those sequences that are least compatible with the current tree topology should be identified and removed for PSA replacement. To score the fit of the sequences to the tree, two nx n pairwise distance matrices are calculated: D, whose elements di,j are the usual pairwise distances between the sequences i and j; and D*, whose elements di,j* are the pairwise distances calculated from the current tree topology, in other words, the sum of branch lengths separating sequences i and j on the tree. These matrices are used to calculate a root mean squared deviation (RMSD) for each sequence i in the tree according to RMSDi = {surd}{sum}j != i(di,j di,j*)2/n, which provides a simple measure of how well each sequence fits the current tree topology. The sequences to be removed are sampled without replacement with probability proportional to their RMSD until r sequences have been removed.

Subtree refinement
After the sequences have been removed from the tree, the topology describing the remaining sequences has a probability of undergoing a round of SNAP refinement. For the implementation discussed here this probability is arbitrarily set to 0.25.

Partial stepwise addition
The PSA approach used in Leaphy is very similar to other stepwise addition approaches, although the computational approximation is only suitable for ML. Starting with an nr species tree, the removed sequences are progressively added in a random order. In standard stepwise addition, the location where a sequence is added is decided by calculating the score of all topologies formed by adding the sequence to each of the branches of the original tree and then choosing the one with the highest likelihood value. Under ML, calculations for reasonably sized trees can be very time consuming because it requires a large number of computationally expensive full likelihood optimizations. PSA is an approximate approach, where the partial likelihood is calculated on each side of the branch where a sequence is to be added, and instead of a full optimization the three branches of the newly formed tree are optimized as though they were a three-species tree (see Fig. 3). The principle behind this is similar to the approximation for calculating likelihoods using subtrees in SNAP (see above). This approach can still be slow when sequences need to be added to large trees but offers a substantial improvement over classic stepwise addition approaches.


Figure 3
View larger version (29K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Figure 3 Pictorial description of the approximation used in partial stepwise addition. A fourth species can be added to any of the branches of the three species tree on the left. The approximation works by calculating a partial likelihood for the central black node and calculating likelihoods as though they were a three-species tree. For example, an n-species tree would require 2n – 3 numerical optimizations on a three-species tree.

 
Stopping Rules
The stopping rule used in this implementation is arbitrary. Based on a tree of n species, this implementation applies two stopping rules. Firstly, the algorithm will terminate after n refinement and resampling steps. Secondly, the algorithm will also terminate if it does not find an improved tree for min (10 + {surd}n,n) consecutive refinement and resampling steps. These choices seem to work well in practice, although it is possible that the number of resamples required by the second rule may be reduced with minimal impact on performance.

Tabu Search
Tabu search has been successfully applied to many difficult numerical optimization problems (Glover et al., 1993), including the traveling salesman problem, and here I introduce it to phylogenetics, albeit in a limited fashion. The underlying principle of tabu search is to prevent the optimization algorithm revisiting regions of parameter space that have already been sampled, usually in the context of a probabilistic search strategy such as simulated annealing. This tabu area can be defined either entirely by the trees already visited or by some of their features, such as particular partitions in trees (branches). For this study only a basic tabu that prevents the TEH visiting previously examined trees is implemented, although the exact details vary slightly for the resampling and refinement scheme.

Tabu for resampling
The resampling scheme will ideally sample widely from good regions of tree space. In practice what often happens is that resampled trees are very similar (if not identical) to previously examined trees, which means that the resampling scheme is not exploring tree space in the most effective manner. To address this issue, instead of only making previously visited trees tabu, each tree and all those within a conceptual radius around them are made tabu. This is achieved by measuring the minimum Robinson-Foulds (RF) distance (Robinson and Foulds, 1981) between the resampled tree and those already visited and rejecting trees that are within a certain distance; for Leaphy this radius is set arbitrarily to 6, which is the equivalent of three NNI operations. Efficient computation of the RF distance is achieved using methodology similar to Day (1985). In cases where there is little uncertainty about the tree, which occurs when there are few sequences or a lot of "clean" data, the resampling scheme consistently revisits the same region of tree space and very few trees are accepted. To stop the resampling scheme getting stuck, it is allowed three attempts to find an acceptable new region of tree space and if it fails a series of random SNAPs are performed, where the accepted trees are chosen randomly. The number of SNAPs performed is sampled from the binomial distribution, with n set to the number of nodes and p = 0.2, and then successively applied randomly to the nodes of the tree.

Tabu for SNAP refinement
The use of an RF distance to define regions of tree space is not so useful for SNAP refinement and instead only trees that have been previously visited are made tabu. This is because some of the trees around a good tree can occasionally lead to an improved tree topology. Consider, for example, the case where a tree is one step away from a previously visited optimum. The T trees considered by the refinement method may contain that optimum and another very good tree that when visited may lead to other good trees. This is in conflict with the previous discussion about tabu in resampling schemes, but forcing the resampling away from the current tree while being less restrictive during refinement is highly effective in practice.

Comparative Performance
Models and likelihood computation
All calculations in all programs for tree and pairwise distance estimation are performed using standard ML approaches under the widely available WAG model (Whelan and Goldman, 2001). All numerical optimizations of the likelihood function required by the new algorithms in Leaphy are performed using a quasi-Newton method (adapted from dfpmin, Press et al., 1992). During computation gaps are considered missing data.

Other programs
For this study the implementation of the new TEHs in Leaphy is compared to six other phylogenetic programs. These are chosen to represent some popular and some particularly effective approaches using a suitable mixture of methodology. These choices are constrained by computational speed but reflect the performance of existing phylogenetic software. Four of these programs base their analyses on pairwise distance estimates (obtained from a fast and accurate in-house program). The first three are clustering algorithms: Neighbor (an implementation of Neighbor-Joining; Felsenstein, 2005), Weighbor (Bruno et al., 2000), and BioNJ (Gascuel, 1997). The final distance-based program is fastME (Desper and Gascuel, 2002), which uses pairwise distance estimates to calculate a minimum evolution score as an objective function to be optimized instead of simple clustering. It initially produces a tree by stepwise addition and then uses the NNI refinement scheme to find an improved tree estimate. The next, Phyml (Guindon and Gascuel, 2003), is intended to be representative of the many programs that use pairwise distance clustering to obtain a starting tree, refine using NNI, and do not use a resampling scheme. Phyml has been demonstrated to be an effective tool for tree estimation and is becoming widely used (e.g., Ciccarelli et al., 2006). The final program is IQPNNI, which, although not currently widely used, is representative of a full TEH, with an NNI refinement scheme and fast and effective resampling. IQPNNI has been shown to outperform many more widely used programs (Vinh and von Haeseler, 2004), which was confirmed by a small preliminary study (results not shown). IQPNNI is run for a minimum of 20 iterations before it applies its dynamic stopping rule and has a probability of 0.5 of deleting a sequence. Several other popular and effective programs are not included in this study, either to ensure directly comparable methodology and/or as a result of computational limitations, including those implementing stochastic searches (e.g., simulated annealing and Bayesian inference; Stamatakis, 2005; Huelsenbeck and Ronquist, 2001) or genetic algorithms (e.g., MetaPiga: Lemmon and Milinkovitch, 2002; Lewis, 1998; Zwickl, 2006). Programs that can only estimate trees on nucleotide sequences were also excluded from this study because the initial version of Leaphy does not include nucleotide models; these include popular programs, such as dnaml, and PAUP*, whereas other programs did not have amino acid models available at the time of analysis, such as RAxML.

To ensure the comparability of results, the initial proposed tree for all score-based methods is taken as that estimated from BioNJ, and source code describing the specific models of amino acid replacement used for each program was examined to ensure that exactly the same model is used. The only exception to this is FastME, which was not amenable to altering the starting tree and was allowed to use its default setting. The following results do not appear to have been unduly affected by any potential advantage this offered.

Criteria for comparison
For many purposes, the performance of a phylogenetic methodology, such as likelihood or parsimony, and the TEH are intertwined, but should be viewed separately when considering method and tool development. A TEH is successful if the best tree estimate found by the algorithm is the globally optimal tree. This global optimum can only be identified by examining impractically large numbers of trees, so calculating it and comparing a program's ability to recover it is impossible for all but the simplest problems. Instead, this study takes the pragmatic approach of identifying the program(s) that estimate the tree with the highest likelihood, which is a measure of the relative performance of programs and TEHs. Phylogenetic methods, on the other hand, may be considered successful if they are statistically consistent, which means for very large amounts of data the globally optimal tree will be the same as the "true" tree, and/or efficient, which means that the methodology rapidly converges to the "true" or "good" tree. This is not the topic of this study and will not be discussed further.

To compare TEHs using likelihood it is necessary to ensure that the likelihoods computed by different programs are comparable. Pairwise distance methods do not explicitly calculate a likelihood, whereas other programs optimize the likelihood to differing degrees of rigor and/or calculate it in subtly different manners, usually varying in their treatment of gaps and/or their treatment of the presence of only a single species in an alignment column. The codeml program of the Paml package (Yang, 1997) is used to obtain a standardized likelihood from a program's optimal tree that is comparable between all programs. This likelihood is the primary method used for comparing the efficacy of the TEH, because measures based on the tree topology may appear different when there are numerous polytomies in a tree. For example, the Robinson-Foulds distance between two trees may be non-zero, but they may have nearly identical likelihoods when internal branches are zero or very close to zero. In this study, trees within a small absolute tolerance of likelihood (10– 4) are considered to have the same topology.

To fully investigate the new algorithm's performance two different versions of the program are examined: LeaphySNAP, which uses only the SNAP refinement step; and Leaphy, which uses both SNAP refinement and PSA resampling. Splitting the analysis in this manner offers an insight into the performance of the constituent parts of the new TEH.

Sequence data
The performance of the new TEH is assessed using the Pandit database, v12.0 (Whelan et al., 2006), which contains large numbers of sequence alignments representative of the types of data evolutionary biologists use to infer trees from. Two restrictions were made on the database that reduced the number of families used: each must contain between 7 and 100 sequences, and the alignment must consist of ≥ 50 columns of aligned amino acids (including gaps). Many phylogenetic programs are very slow and a small number of families that did not complete their tree estimation under an individual program in 7 days CPU time were removed. In all cases where this occurred it was the result of IQPNNI failing to complete. This resulted in 4059 different families used for the comparison of phylogenetic programs. One important feature to consider when examining the following results is that families with long alignments tend to have fewer sequences (see Fig. 4). The tree estimates and their associated likelihoods are available upon request from the author or from http://systematicbiology.org. Pandit is available at the URL: http://www.ebi.ac.uk/goldman-srv/pandit/.


Figure 4
View larger version (92K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Figure 4 Comparison of the number of sequences alignments contain and their overall length (including gaps). The histograms on the top and right of the graph are the overall densities for the x- and y-axis, respectively. Note that alignments with many sequences tend to have shorter alignments.

 

    Results
 Top
 Abstract
 Tree Estimation Heuristics
 Algorithms and Methods
 Results
 Conclusions
 Appendix 1
 References
 
The new algorithms were implemented in the program Leaphy and their performance assessed relative to other widely used and fast phylogenetic programs. When examining this performance two statistics were found particularly useful: the proportion of the time a program recovers the best tree topology that was discovered by any programs under consideration, and the fraction of the time a program recovers a unique best topology that no other program recovers. The former statistic is a measure of the overall performance of a method and the latter is a measure of its value compared to the other programs. Note that both statistics are calculated based on the likelihood of the tree estimate, assessed independently using codeml, and not the topology—assessment is based on the best likelihood found, not any assumed knowledge of the "true" tree. For a program to have qualified to have found the best tree, its ML estimate must be within 10 4 log likelihood units of the maximum found across all programs. Similarly, for a program to have found a unique best tree, its estimate must be 10– 4 likelihood units higher than any other trees found. When discussing the performance of the new algorithms, it is useful to partition the analysis of results into two sections: the performance of the SNAP refinement (LeaphySNAP) and the performance of the combined algorithms (Leaphy).

Performance of SNAP Refinement—LeaphySNAP
The performance of LeaphySNAP, which uses only a single round of SNAP refinement, demonstrates how useful the new SNAP algorithm is for exploring tree space. Initially, it serves to demonstrate its effectiveness against programs that do not have a resampling step; in other words, all of the standard programs except IQPNNI. In 4015 of the 4059 families (98.9%) examined, LeaphySNAP found the best tree, and in 2709 families (66.7%) it found a unique best tree topology. Figure 5 demonstrates the performance of all programs broken down by the size of the family, with Figure 5a showing the frequency that the best tree is recovered and Figure 5b showing how frequently different programs recover a unique best tree. For many programs the frequency that a best tree is discovered starts relatively high, around 0.5, and steadily decreases as the number of sequences increases. These cases where all software find the best tree tend to be the easiest phylogenetic tree estimation problems, where there are long alignments with few sequences, and all methods are capable of recovering the best tree.


Figure 5
View larger version (28K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Figure 5 Comparison of LeaphySNAP with other programs that do not have a resampling step in their tree-estimation heuristic. The proportion of families where a program recovers the best tree and the proportion of families where a program recovers a best tree that no other program recovers are shown in a and b, respectively. The symbols describe the programs: {triangleup} NJ; {blacktriangleup} Weighbor; {square} bioNJ; {blacksquare} fastME; x phyml; and + LeaphySNAP. In b, the dotted line describes how frequently any individual program recovers a unique best tree. Symbols for some programs overlap each other near 0.

 
Table 1 contains the total likelihood recorded over all of the families considered in the analysis, providing a measure of how well the trees estimated by programs fit the observed sequence data. Larger differences in likelihood may be associated with more different tree topologies and, as a consequence, potentially quite different patterns of amino acid replacement. The total likelihood obtained under LeaphySNAP is substantially higher than those obtained with other methods. For example, the large improvement of LeaphySNAP over its nearest rival, phyml, is 33, 211.7 likelihood units, an average increase of 8.2 likelihood units per family. In the 3323 families for which LeaphySNAP finds a better tree than phyml, the likelihood improvement per families increases to 10.0. These large differences in likelihoods may indicate that tree estimates may be substantially different, which may affect the results of other inferences, such as the detection of selection or conserved sequences in genomes.


View this table:
[in this window]
[in a new window]

 
Table 1. Performance of tree-estimation programs.

 
The combined evidence presented in these results clearly demonstrates that the use of a single round of SNAP refinement outperforms other comparable TEHs. All programs ran quickly, with a maximum runtime of approximately 2 hours, with LeaphySNAP generally being the slowest of the programs used.

The relative performance of LeaphySNAP declines when it is compared to IQPNNI, which resamples from tree space. Figure 6a and Figure 6b show how frequently different programs discover the best tree and unique best trees, respectively. The use of the important quartet puzzling resampling strategy is clearly beneficial and, when combined with NNI, it is more effective than a single round of other refinement schemes in a substantial number of cases. This is particularly noticeable for larger families, where the tree estimation problem is more difficult and the original tree estimate may be further from the originally proposed tree. The fact that LeaphySNAP still performs quite well in many cases, finding the best tree in 3201 families (78.9%), and finding the unique best tree in 526 families (13.0%), demonstrates the impact that a good refinement scheme can have on tree inference.


Figure 6
View larger version (32K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Figure 6 Comparison of LeaphySNAP with other programs. The proportion of families where a program recovers the best tree and the proportion of families where a program recovers a best tree that no other program recovers are shown in a and b, respectively. The symbols used are {triangleup} NJ; {blacktriangleup} Weighbor; {square} bioNJ; {blacksquare} fastME; x phyml; + LeaphySNAP; ^ IQPNNI. In b, the dotted line describes how frequently any individual program recovers a unique best tree. Symbols for some programs overlap each other near 0.

 
The overall likelihoods in Table 1 for IQPNNI and LeaphySNAP are both substantially higher than other programs, which is indicative of the weaknesses of the NNI refinement scheme where the small neighborhood of trees produces a very ragged landscape with many local optima. IQPNNI was, in general, slower than LeaphySNAP, with a small number of runs terminated and the families discarded from the analysis when over 7 days of CPU usage was recorded.

Performance of Full Algorithm—Leaphy
The inclusion of a resampling strategy and tabu searching in Leaphy substantially improves the quality of the tree estimates. Table 1 shows an increase in log likelihood of 5587.8 obtained by introducing the resampling strategy to the refinement scheme. This translates into Leaphy providing a better tree estimate than LeaphySNAP in 1168 families (28.8%). The full algorithm also compares well to all of the other phylogenetic programs examined. Overall, Leaphy finds the best tree in 4004 families (98.6%) and a unique best tree in 1065 families (26.2%). This contrasts well with its nearest rival, IQPNNI, which finds the best tree in 2957 families (72.9%) but finds a unique best tree in only 55 families (1.4%). Figure 7a and Figure 7b show this performance broken down by number of sequences in a family, showing the frequency that programs estimate the best and unique best trees, respectively. These graphs show that as the number of sequences increases and the tree estimation problem gets harder, the new algorithms in Leaphy become increasingly more effective at finding better tree topologies. For example, for the 151 families containing between 40 and 50 sequences, IQPNNI and Leaphy estimate the best (unique best) tree topology in 36.4% (4.0%) and 95.5% (63.6%) of families, respectively. More generally, the mean number of sequences in the 1157 families (28.5%) when IQPNNI and Leaphy differ is 33.9, compared to a mean number of sequences of 20.2 when considering all families, demonstrating that Leaphy's performance is markedly better in larger families. It is also interesting to note how frequently phylogenetic programs agree. In 49 of the 151 families containing 40 to 50 sequences (32.5%), IQPNNI and Leaphy estimate the same best tree, which may be indicative that estimated tree is particularly good or that a single local optimum is very prevalent, attracting the refinement scheme towards it from large portions of tree space. The frequency with which programs agree decreases as the problem becomes harder, with agreement between IQPNNI and Leaphy only occurring for 6 out of 71 of families (8.5%) containing between 80 and 100 sequences.


Figure 7
View larger version (31K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Figure 7 Comparison of Leaphy with all other programs. The proportion of families where a program recovers the best tree and the proportion of families where a program recovers a best tree that no other program recovers are shown in a and b, respectively. The symbols used are {triangleup} NJ; {blacktriangleup} Weighbor; {square} bioNJ; {blacksquare} fastME; x phyml; + Leaphy; ^ IQPNNI. In b, the dotted line describes how frequently any individual program recovers a unique best tree. Symbols for some programs overlap each other near 0.

 
The relative performance of programs can also be examined by calculating the proportion of branches shared between the tree estimated by a program and the best tree estimated by any program (Robinson and Foulds, 1981). The results of this analysis, shown in Figure 8, are comparable to those obtained by examining the likelihoods (Fig. 7). On average, Leaphy has a very high proportion of inferred branches that match the best estimated tree. IQPNNI does not match Leaphy's performance but does noticeably better than the other programs. Examining the total number of branches matching the highest scoring tree is also informative, because these represent the hypotheses that are often tested in phylogenetic trees. For the 36 families with between 90 and 100 sequences, Leaphy recovered 3033/3265 (92.8%) of branches from the best tree, whereas IQPNNI recovered 2638/3265 (80.8%) of these branches. Furthermore, it is notable that all families with a difference in likelihoods above the 10– 4 threshold were found to have different trees, demonstrating the effectiveness using a likelihood (computed independently using codeml from the PAML package) to decide whether trees are different.


Figure 8
View larger version (17K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Figure 8 The proportion of branches shared between the optimal tree estimated by individual programs and the best tree estimated by any program. The symbols used are {triangleup} NJ; {blacktriangleup} Weighbor; {square} bioNJ; {blacksquare} fastME; x phyml; + Leaphy; ^ IQPNNI.

 
This improvement is also apparent in the likelihoods presented in Table 1. Leaphy offers an improvement of 3028.8 over its nearest rival, IQPNNI. In the 1157 families where the two programs differ, Leaphy provides a better estimate in 1102 families with a mean improvement in log likelihood of 2.8. In contrast, IQPNNI provides a better estimate in only 55 families with a mean improvement in log likelihood of 1.6, which constitutes the 86.5 log likelihood improvement over Leaphy that can be obtained by choosing the best tree from all phylogenetic programs (see Table 1). The approximations used in the SNAP and PSA algorithms ensure the tree estimation problem remains computationally tractable. The average times taken by Leaphy to estimate the tree for each family are shown in Figure 9, broken down by family size. As expected, larger families require more time to estimate trees, with the fit of the trend line suggesting something similar to quadratic complexity. The size of the interquartile range also increases as the number of sequences in a family grows. The slowest family, PF00076, which contains an alignment of 77 amino acids covering 90 sequences, took 56 hours to complete and was 21 hours slower than any other family. This family had 698 trees stored for use in tabu searching, which is the highest number for any family and is substantially higher than the mean number of 90.2 (interquartile range of 57 to 109). The reason why this family took so long is unclear, but the likelihood of –13,476.5 for the tree estimated by Leaphy was substantially better than the likelihood of –13,517.5 obtained by IQPNNI, the best performing of the remaining programs, demonstrating that the slow computational time led to an improved tree estimate. One should, however, be cautious in generalizing these timings because it is unclear how the negative correlation between sequence length and number of sequences in a family (Fig. 4) affects computational time.


Figure 9
View larger version (15K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Figure 9 Points show the median amount of time taken, in minutes, for Leaphy to compute its final tree estimate; vertical bars indicate the interquartile range of time estimates. The dotted line is the best fitting quadratic that goes through the origin (y = 0.031x2.048; R2 = .995).

 
These results strongly support the conclusion that the new algorithms presented here provide significant progress over other comparable fast tree estimation heuristics, although in some particularly difficult cases it may be advisable to use multiple phylogenetic programs for tree inference.


    Conclusions
 Top
 Abstract
 Tree Estimation Heuristics
 Algorithms and Methods
 Results
 Conclusions
 Appendix 1
 References
 
This study introduces a new set of tree estimation heuristics (TEHs) and demonstrates the effectiveness of an implementation, Leaphy, at estimating phylogenetic trees from protein sequence alignments. The results clearly show that the likelihood of the trees estimated by Leaphy are equal to or higher than those of the other programs considered in the overwhelming majority of cases (98.6% of families). In this type of study, it is only possible to study a small fraction of the available phylogenetic programs. I note that many highly successful programs are not included, usually because they are too slow, do not analyze amino acid sequences, or do not include the WAG model. Similarly, this study does not use simulated data to assess the efficacy of a TEH to recover the "true" tree. Examining simulated data in this manner confounds the problem of assessing the effectiveness of a TEH with the properties of the method. It also places undue importance on the proposal mechanism, which in simulated data tends to perform better because the model is accurate and there are none of the complexities that are endemic in real data, such as changes in composition and selective pressures. In general, I argue that the results presented here are indicative of good future performance of the new TEHs, but this can only be truly assessed by its use in large numbers of studies undertaken by independent researchers, as typically happens during the development of phylogenetic tree estimation programs.

The utility of the PSA and the SNAP families of algorithms should not only be judged through the comparison of the single implementation used in Leaphy with other programs. The specific versions of the algorithms used in Leaphy have some underlying problems that would ideally be addressed in future implementations. For example, the refinement scheme is quite limited, especially for larger trees. Ideally, the use of larger radius SNAPs could be combined with computational approximations based on, for example, pairwise distances between internal nodes, and more exact branch-and-bound algorithms. This could reduce the number of local optima and be expected to further improve the quality of the tree estimate. The resampling scheme used here is also basic: it is computationally slow for larger trees, even with the approximations used in PSA, and has not been proved to have some desirable properties. For example, it would be beneficial to sample regions of tree space in rough proportion to their likelihood and to ensure that all points of tree space are sampled with a non-zero probability. The current implementation of PSA is unlikely to meet either of these criteria but does represent an effective starting point from which to build future resampling schemes. It is also worth noting that the refinement and resampling strategy may be useful for other methods of phylogenetic inference, such as maximum parsimony and Bayesian inference. It is not difficult to imagine a Markov chain Monte Carlo sampling approach based on any radius of SNAP rearrangements.

The programs used in this study are available on request from the author or via the website http://www.bioinf.manchester.ac.uk/leaphy for Linux and Windows.


    Appendix 1
 Top
 Abstract
 Tree Estimation Heuristics
 Algorithms and Methods
 Results
 Conclusions
 Appendix 1
 References
 
Point Calculation as an Approximation to Full Optimization
A major computational bottleneck when performing a SNAP refinement is the full numerical optimization of the likelihood of the topologies in its neighborhood, Ti, consisting of up to 105 tree topologies. To reduce this computational burden, I introduce an approximation where tree topologies are rapidly assessed by calculating a single likelihood value and a small number of the best are then fully optimized. Each tree in Ti consists of external branches leading to the subtrees formed during SNAP and internal branches whose presence define the new tree topology. The lengths of the external branches are taken to be those of the original tree topology and the lengths of the internal branches are set to an arbitrary small number. These values are used to calculate a single likelihood for each tree in Ti, after which an arbitrary number are fully optimized before performing the SNAP step.

Using a single likelihood calculation with small internal branches to compare topologies may be thought of as an approximation to calculating the first derivative of the likelihood function with respect to branch lengths at a star topology with a set of short branches defined by the topology. This would be indicative of how much evidence there is for the existence of a particular set of internal branches in a tree. Preliminary studies tested small branch lengths ranging from 0.01 to 1.0 and indicated that 0.l produces likelihoods whose rank order correlates well with that of the fully optimized likelihoods (results not shown). The efficacy of the approximation is tested by examining the best tree estimated using LeaphySNAP (see Results) with and without this approximation. A majority of families, 3137 (77.2%), show no change in the final tree estimate. When there is a change in tree topology, it is as likely to increase the likelihood as decrease it: 431 (10.6%) families show a decrease in likelihood (mean 2.1), whereas 491 (12.1%) families show an increase in likelihood (mean 2.1).


    Acknowledgements
 
I would like to thank Nick Goldman and Tim Massingham for insightful comments during the development of the new algorithms, Des Higgins for introducing me to the idea of tabu search, and Alexandros Stamatakis and an anonymous reviewer for their useful and constructive comments. The majority of the tree estimations were performed on a 200-CPU cluster, kindly provided by IBM to the Research Program of the European Bioinformatics Institute.


    References
 Top
 Abstract
 Tree Estimation Heuristics
 Algorithms and Methods
 Results
 Conclusions
 Appendix 1
 References
 

    Bruno W. J., Socci N. D., Halpern A. L. Weighted neighbor joining: A likelihood-based approach to distance-based phylogeny reconstruction. Mol. Biol. Evol. (2000) 17:189–197.[Abstract/Free Full Text]

    Charleston M. A. Hitch-hiking: A parallel heuristic search strategy, applied to the phylogeny problem. J. Comput. Biol. (2001) 8:79–91.[CrossRef][Web of Science][Medline]

    Ciccarelli F. D., Doerks T., von Mering C., Creevey C. J., Snel B., Bork P. Towards automatic reconstruction of a highly resolved tree of life. Science (2006) 311:1283–1287.[Abstract/Free Full Text]

    Day W. H. E. Optimal algorithms for comparing trees with labeled leaves. J. Class. (1985) 2:7–28.[CrossRef]

    Desper R., Gascuel O. Fast and accurate phylogeny reconstruction algorithms based on the minimum-evolution principle. J. Comp. Biol. (2002) 19:687–705.

    Felsenstein J. Inferring phylogenies (2004) Sunderland, Massachusetts: Sinauer Associates.

    Felsenstein J. PHYLIP (phylogeny inference package). Version 3.6 (2005) Seattle: Department of Genome Sciences, University of Washington. Distributed by the author.

    Ganapathy G., Ramachandran V., Warnow T. On contract-and-refine transformations between phylogenetic trees. Ian Munro J., ed. (2004) Proceedings of the Fifteenth Annual ACM-SIAM Symposium on Discrete Algorithms, SODA 2004: New Orleans, LouisianaUSA. SIAM. 893–902.

    Gascuel O. BIONJ: An improved version of the NJ algorithm based on a simple model of sequence data. Mol. Biol. Evol. (1997) 14:685–695.[Abstract]

    Glover F., Taillard E., de Werra D. A user's guide to tabu search. Ann. Operations Res. (1993) 41:3–28.

    Goldman N., Anderson J. P., Rodrigo A. G. Likelihood-based tests of topologies in phylogenetics. Syst. Biol. (2000) 49:652–670.[Abstract/Free Full Text]

    Guindon S., Gascuel O. A simple, fast and accurate method to estimate large phylogenies by maximum-likelihood. Syst. Biol. (2003) 52:696–704.[Abstract/Free Full Text]

    Hahn B. H., Shaw G. M., de Cock K. M., Sharp P. M. AIDS as a zoonosis: Scientific and public health implications. Science (2000) 287:607–614.[Abstract/Free Full Text]

    Huelsenbeck J. P., Ronquist F. MrBayes: Bayesian inference of phylogeny. Bioinformatics (2001) 17:754–755.[Abstract/Free Full Text]

    Huson D. H., Nettles S. M., Warnow T. J. Disk-covering, a fast converging method for phylogenetic tree reconstruction. J. Comp. Biol. (1999) 6:369–386.[CrossRef]

    Kosiol C., Bofkin L., Whelan S. Phylogenetics by likelihood: Evolutionary modeling as a tool for understanding the genome. J. Biomed. Informatics (2006) 39:51–61.[CrossRef][Web of Science][Medline]

    Lemmon A. R., Milinkovitch M. C. The metapopulation genetic algorithm: An efficient solution for the problem of large phylogeny estimation. Proc. Natl. Acad. Sci. USA (2002) 99:10516–10521.[Abstract/Free Full Text]

    Lewis P. O. A genetic algorithm for maximum likelihood phylogeny inference using nucleotide sequence data. Mol. Biol. Evol. (1998) 15:277–283.[Abstract]

    Lundy M. Applications of the annealing algorithm to combinatorial problems in statistics. Biometrika (1985) 72:191–198.[Abstract/Free Full Text]

    Nielsen R., ed. Statistical methods in molecular evolution (2005) New York: Springer.

    Nixon K. The parsimony ratchet, a new method for rapid parsimony analysis. Cladistics (1999) 15:407–414.[CrossRef][Web of Science]

    Press H. P., Teukolsky S. A., Vetterling W. T., Flannery B. P. Numerical recipes in C (1992) 2nd edition. Cambridge, UK: Cambridge University Press.

    Robinson D. R., Foulds L. R. Comparison of phylogenetic trees. Math. Biosci. (1981) 53:131–147.[CrossRef][Web of Science]

    Rogers J. S. On the consistency of maximum likelihood estimation of phylogenetic trees from nucleotide sequences. Syst. Biol. (1997) 46:354–357.[Free Full Text]

    Saitou N., Nei M. The neighbor-joining method: A new method for reconstructing phylogenetic trees. Mol. Biol. Evol. (1987) 4:406–425.[Abstract]

    Salter L., Pearl D. K. Stochastic search strategy for estimation of maximum likelihood phylogenetic trees. Syst. Biol. (2001) 50:7–17.[Abstract/Free Full Text]

    Sankoff D., Abel Y., Hein J. A tree, a window, a hill; generalization of nearest neighbor interchange in phylogenetic optimization. J. Class. (1994) 11:209–232.[CrossRef]

    Siepel A., Bejerano G., Pedersen J. S., Hinrichs A., Hou M., Rosenbloom K., Clawson H., Spieth J., Hillier L. W., Richards S., Weinstock G. M., Wilson R. K., Gibbs R. A., Kent W. J., Miller W., Haussler D. Evolutionarily conserved elements in vertebrate, insect, worm, and yeast genomes. Genome Res. (2005) 15:1034–1050.[Abstract/Free Full Text]

    Stamatakis A. An efficient program for phylogenetic inference using simulated annealing. (2005) Proceedings of 19th IEEE/ACM International Parallel and Distributed Processing Symposium, High Performance Computational Biology Workshop: DenverColorado.

    Stamatakis A., Ludwig T., Meier H. RAxML-III: A fast program for maximum likelihood-based inference of large phylogenetic trees. Bioinformatics (2005) 21:456–463.[Abstract/Free Full Text]

    Swofford D. L., Olsen G. J., Waddell P. J., Hillis D. M. Phylogenetic inference. In: Molecular systematics—Hillis D. M., Moritz C., Mable B. K., eds. (1996) 2nd edition. Sunderland, Massachusetts: Sinauer Associates. 407–514.

    Thornton J. M., Orengo C. Protein families and their evolution—A structural perspective. Annu. Rev. Biochem. (2005) 74:867–900.[CrossRef][Web of Science][Medline]

    Vinh L. S., von Haeseler A. IQPNNI: Moving fast through tree space and stopping in time. Mol. Biol. Evol. (2004) 21:1565–1571.[Abstract/Free Full Text]

    Vos R. A. Accelerated likelihood surface exploration: The likelihood ratchet. Syst. Biol. (2003) 52:368–373.[Abstract/Free Full Text]

    Whelan S., de Bakker P. I. W., Quevillon E., Rodriguez N., Goldman N. PANDIT: An evolution-centric database of protein and associated nucleotide domains with inferred trees. Nucleic Acids Res. (2006) 34:D327–D331.[Abstract/Free Full Text]

    Whelan S., Goldman N. A general empirical model of protein evolution derived from multiple protein families using a maximum likelihood approach. Mol. Biol. Evol. (2001) 18:691–699.[Abstract/Free Full Text]

    Yang Z. PAML: A program package for phylogenetic analysis by maximum likelihood. CABIOS (1997) 13:555–556.[Medline]

    Yang Z., Goldman N., Friday A. Maximum likelihood trees from DNA sequences: A peculiar statistical estimation problem. Syst. Biol. (1995) 44:384–39.[Abstract]

    Zwickl D. J. Genetic algorithm approaches for the phylogenetic analysis of large biological sequence datasets under the maximum likelihood criterion (2006) The University of Texas at Austin. Ph.D. dissertation.


Add to CiteULike CiteULike   Add to Connotea Connotea   Add to Del.icio.us Del.icio.us    What's this?


This article has been cited by other articles:


Home page
ScienceHome page
A. Loytynoja and N. Goldman
Uniting Alignments and Trees
Science, June 19, 2009; 324(5934): 1528 - 1529.
[Abstract] [Full Text] [PDF]


Home page
Phil Trans R Soc BHome page
S. Whelan
The genetic code can cause systematic bias in simple phylogenetic models
Phil Trans R Soc B, December 27, 2008; 363(1512): 4003 - 4011.
[Abstract] [Full Text] [PDF]


Home page
Syst BiolHome page
A. Stamatakis, P. Hoover, and J. Rougemont
A Rapid Bootstrap Algorithm for the RAxML Web Servers
Syst Biol, October 1, 2008; 57(5): 758 - 771.
[Abstract] [Full Text] [PDF]


Home page
Syst BiolHome page
J. P. Huelsenbeck, C. Ane, B. Larget, and F. Ronquist
A Bayesian Perspective on a Non-parsimonious Parsimony Model
Syst Biol, June 1, 2008; 57(3): 406 - 419.
[Abstract] [Full Text] [PDF]


This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (PDF) Freely available
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in ISI Web of Science
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrow Search for citing articles in:
ISI Web of Science (7)
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Whelan, S.
Right arrow Search for Related Content
PubMed
Right arrow Articles by Whelan, S.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?