Skip Navigation

Systematic Biology 2006 55(5):756-768; doi:10.1080/10635150600975218
This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (PDF) Freely available
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in ISI Web of Science
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrow Search for citing articles in:
ISI Web of Science (13)
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Boussau, B.
Right arrow Articles by Gouy, M.
Right arrow Search for Related Content
PubMed
Right arrow Articles by Boussau, B.
Right arrow Articles by Gouy, M.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?

© 2006 Society of Systematic Biologists

Efficient Likelihood Computations with Nonreversible Models of Evolution

Edited by Olivier Gascuel: Associate Editor

Bastien Boussau and Manolo Gouy

Laboratoire de Biométrie et Biologie Evolutive (UMR 5558); CNRS, Université Lyon 1 43 boulevard 11 nov 1918, 69622, Villeurbanne Cedex, France E-mail: boussau{at}biomserv.univ-lyon1.fr (B.B.)


    Abstract
 Top
 Abstract
 Computing the Likelihood of...
 nhPhyML, Adaptation of PhyML...
 Estimation of the Universal...
 Conclusion
 References
 
Recent advances in heuristics have made maximum likelihood phylogenetic tree estimation tractable for hundreds of sequences. Noticeably, these algorithms are currently limited to reversible models of evolution, in which Felsenstein's pulley principle applies. In this paper we show that by reorganizing the way likelihood is computed, one can efficiently compute the likelihood of a tree from any of its nodes with a nonreversible model of DNA sequence evolution, and hence benefit from cutting-edge heuristics. This computational trick can be used with reversible models of evolution without any extra cost. We then introduce nhPhyML, the adaptation of the nonhomogeneous nonstationary model of Galtier and Gouy (1998; Mol. Biol. Evol. 15:871–879) to the structure of PhyML, as well as an approximation of the model in which the set of equilibrium frequencies is limited. This new version shows good results both in terms of exploration of the space of tree topologies and ancestral G+C content estimation. We eventually apply it to rRNA sequences slowly evolving sites and conclude that the model and a wider taxonomic sampling still do not plead for a hyperthermophilic last universal common ancestor.

Keywords: Efficient algorithm; LUCA; maximum likelihood; molecular phylogeny; nonreversible model of evolution; PhyML; origin of life; root of life

Received October 17, 2006; Revised December 10, 2005; Accepted March 28, 2006


Research in molecular phylogeny aims at reconstructing historical relations between genes or species while trying to capture the true nature of the evolutionary process itself. Both can be estimated at the same time through the use of statistical modeling. Maximum likelihood or the Bayesian framework permit the estimation of parameters of the evolutionary model such as the transition/transversion ratio, the equilibrium base composition, and the tree itself, topology and branch lengths included. Optimizing all these parameters is computationally intensive: the number of possible topologies increases factorially with the number of taxa considered, which makes it necessary to use heuristics when exploring the space of tree topologies. Most recent algorithms (e.g., PhyML [Guindon and Gascuel, 2003], RAxML [Stamatakis et al., 2005]) are able to find trees with excellent likelihood scores for hundreds of sequences, but only with reversible models of evolution. All these reversible models are homogeneous and stationary, i.e., suppose that state evolution is constant all over the tree. If this hypothesis were true, sequences sharing a common ancestor would have the same expected base frequencies.

More precisely, a process of evolution is homogeneous when the state distribution probability simply depends on the time separating it from a given past state distribution probability and not on the branch in the tree: homogeneity is the feature of a process of evolution that is constant in pattern over the whole tree. On the other hand, stationarity is the feature of a process of evolution that keeps the state distribution probability constant over the whole tree: the probability to draw a given state is the same wherever on the tree the sampling is done. A process can be stationary and not be homogeneous, as is the case for Ziheng Yang's codon model in which the nonsynonymous-to-synonymous ratio varies across branches while codon equilibrium frequencies remain constant all over the tree (Yang, 1998). On the contrary, nonstationarity induces nonhomogeneity, as the process of evolution depends upon the equilibrium frequencies.

The analysis of extant sequences shows that homologous genes vary widely in their composition. As they all stem from a common ancestor, this evidences that sequence evolution is at least not stationary: two sequences in two different species or at two different periods evolve towards different compositions.

The use of nonhomogeneous and nonstationary models that account for this variability in evolution permits minimizing compositional biases and hence improving phylogenetic reconstructions (Galtier and Gouy, 1998; Tarrio et al., 2001; Herbeck et al., 2005). Unfortunately, removing the homogeneity and stationarity hypotheses implies abandoning reversibility, and hence prevents one from using the most efficient algorithms, when those particularly variable-rich models would most eagerly need it.

In this article we show in the general case that it is possible to use recent algorithms with nonreversible models of sequence evolution. We first explain how the likelihood of a tree is computed and how the reversibility property is used in recent heuristics to avoid dispensable calculations during tree space search. We then prove that the same computational trick can be used with nonreversible models of evolution through a reorganization of the way likelihood is computed. As reversible models of evolution can also be used in this framework, this work can be considered as a generalization of the usual formulas, which considerably broadens the amount of models that can be used for a phylogenetic analysis, as all nonhomogeneous as well as nonstationary models can be used.

Eventually we report nhPhyML, the first adaptation of PhyML Guindon, 2003, a very fast and efficient algorithm, to a nonreversible model of evolution; namely, the implementation of Tamura's model at each branch of a tree introduced by Galtier and Gouy (Tamura, 1992; Galtier and Gouy, 1998). We also introduce nhPhyML-Discrete, an approximation of nhPhyML. This implementation shows better performance than nhPhyML in the exploration of the space of tree topologies and similar accuracy in the estimation of the ancestral G+C content. Eventually, we apply nhPhyML to ribosomal RNA slowly evolving sites and conclude that a wider taxonomic sampling than in Galtier et al. (1999) still does not support a hyperthermophilic last universal common ancestor.


    Computing the Likelihood of a Tree from Any Node under a Nonreversible Model of DNA Sequence Evolution
 Top
 Abstract
 Computing the Likelihood of...
 nhPhyML, Adaptation of PhyML...
 Estimation of the Universal...
 Conclusion
 References
 
Computing the Likelihood of a Tree
We first explain how one computes the likelihood of a phylogenetic tree with DNA sequences using the following example (Fig. 1).


Figure 1
View larger version (20K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Figure 1 Example rooted tree for likelihood computation. This tree is composed of a root R, an internal node U, three other nodes or leaves A, B, and C, and four branches of length lA, lB, lC, lU, and other evolutionary parameters vA, vB, vC, vU. We are here interested in the likelihood of the tree for a single site. The internal node states are unknown and then represented as variables x at node R, y at node U, z at node A, q at node B, and v at node C. Arrows represent the evolutionary direction, from the root of the tree to its leaves.

 
Most commonly, sites are supposed to evolve independently of each other: a site does not depend on its neighbors' states but only on its past state. As a consequence, the likelihood of a tree for a whole sequence is obtained by multiplying all the likelihoods obtained at single sites.

The likelihood Ls of the tree given in Figure 1 for a single site s is computed as follows:


Formula 1

(1)
where Pxy(lA,vA) is the probability for base x to change into base y along a branch of length lA and other evolutionary parameters vA, P(R = x) is the probability to have base x at the root R, and {Omega} = {A,T,C,G} is the set of possible DNA bases. Ls,low(RA)(A = z) is the lower conditional likelihood (Felsenstein, 1981) of observing the data downstream from branch RA conditionally on the underlying subtree and on having base z at node A. For each subtree, one can define four conditional likelihoods, one for each DNA base. Once these conditional likelihoods have been computed for a subtree, as long as its topology and branch lengths do not change, they can be re-used if one moves the whole subtree around the topology. This property is used in recent heuristics to search for the most likely phylogenetic tree. These conditional likelihoods are defined as lower, in the sense that they do not contain the root.

Lower conditional likelihoods are defined recursively. For a leaf C:


Formula 2

(2)

And for a subtree whose root is in U:


Formula 3

(3)

Computing the Likelihood When the Model of Evolution Is Reversible
Reversibility
When computing the likelihood of a phylogenetic tree, a root R must be specified. If the model is homogeneous and reversible, the process of evolution is stationary: wherever the root is, its base proportions are the same, i.e., they are the equilibrium frequencies of the process, noted {pi}: P(R = x) = {pi}x. (1) can be rewritten:


Formula 4

(4)

Reversibility means that, averaged over the whole sequence, the flux from one base to another is equal to the flux from this other base back to the first one:


Formula

Supposing the model used is reversible, as is the case with most current models of DNA sequence evolution, we can rewrite the likelihood in (4) as:


Formula 5

(5)

Expression (5) can be read as if the root was placed at node U. The root can therefore be placed at any node, on any branch of the tree, a property named "pulley principle" by Felsenstein (1981), and widely used in heuristics to find most likely trees. Considering Figure 1, this makes arrows meaningless.

The possibility to place the root of the tree wherever is needed is thoroughly used in recent heuristics to the problem of the most likely phylogenetic tree. Those heuristics usually explore the space of tree topologies by applying local rearrangements: "nearest neighbor interchange" (NNI) swaps two subtrees around an internal edge (used in PhyML [Guindon and Gascuel, 2003]), "subtree pruning and re-grafting" removes a subtree from the whole tree and places it on another edge (used in RAxML [Stamatakis et al., 2005]), and "tree bisection and reconnection" splits the tree into two subtrees that are rewired by any of their edges. In all these rearrangements, whole subtrees remain fixed: their branches still have the same parameters and their internal topology is unchanged. By defining conditional likelihoods for fixed subtrees, and by placing the root at the rearrangement point, one can avoid much computation when exploring the space of tree topologies. As the root is placed at the rearrangement point, all the conditional likelihoods can be considered as lower from a mathematical point of view since none contains the root.

The most efficient algorithms first compute conditional likelihoods for all subtrees, before they compute an approximate likelihood for topologies obtained with a given sort of rearrangement, using the previously obtained conditional likelihoods. They apply the most promising rearrangements, either all at once (PhyML) or as soon as it is tried (RAxML), optimize evolutionary parameters of the new tree, and eventually start a new round of conditional likelihood calculation and exploration of the space of tree topologies, until convergence.

Computing the Likelihood of a Tree with a Nonreversible Model
Upper conditional likelihoods
In the nonreversible case, upper conditional likelihoods can be defined to account for the true root of the tree and the evolutionary directions of the branches.

We define the upper conditional likelihood at branch RU in the nonreversible case as:


Formula 6

(6)

The underlying branches' upper likelihoods can also be defined recursively:


Formula 7

(7)

The main difference lies in the incorporation of the root nucleotide frequencies in the definition of the upper conditional likelihoods. This way, the root is not moved around the topology, and the evolutionary direction is conserved.

We now prove that the expression of the tree likelihood is not changed when computed from other nodes of the tree using upper and lower likelihoods.

Recurrence
We show that for any branch, say UB,


Formula

We initialize the recurrence with branch RU:


Formula 8

(8)

We expand it:


Formula

With (1):


Formula

The likelihood computed on the branch RU is the same as the one computed at the root. This is also true for the edge RA.

We now suppose we know the likelihood at a branch RU and are interested in the likelihood at an underlying branch UB.


Formula 9

(9)

We expand it, using (7):


Formula

And then rearrange it:


Formula

So that we have proved, with (8):


Formula

By recurrence, we have shown that the likelihood value can be computed from any branch of the tree, which, as will be seen, is particularly useful when exploring the space of tree topologies.

Exploring the Space of Tree Topologies with a Nonreversible Model
Efficient heuristics explore the space of tree topologies by local rearrangements such as nearest neighbor interchanges (NNIs). In the nonreversible case, evolution proceeds from the root of the tree to its leaves, so one must keep this evolutionary direction unchanged. For this purpose, a distinction is made between the branch on which the root is placed and the others. Figure 2a shows that being able to compute the likelihood from branch UB permits to define four subtrees whose conditional likelihoods can be used as constants to estimate the likelihoods of the three alternate topologies. In the nonreversible case, one uses upper conditional likelihood for the root-containing subtree and lower likelihoods for all other subtrees. With no loss of generality, NNIs around branch UB only require exchanges of subtrees having lower conditional likelihoods.


Figure 2
View larger version (73K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Figure 2 Use of conditional likelihoods when applying NNIs (Nearest Neighbor Interchanges) to an internal branch. (a) NNIs are applied to internal branch UB. The root node is situated in the subtree noted "UPP." The three other subtrees, named D, C, and E, are all lower. By swapping lower subtrees, three different topologies noted 1, 2, 3 are obtained. One can use the conditional likelihoods (upper in one case, lower in the three other cases) of the four subtrees to speed up the likelihood computation for these three alternate topologies. (b) The root node is now situated on the branch around which the NNI is done. All the conditional likelihoods are therefore lower, so any swap can be done.

 
The likelihood of topology 1 Figure 2a can be computed with:


Formula

The likelihoods of topologies 2 and 3 are computed similarly.

In case the internal branch around which NNIs are to be done possesses the root, the situation is slightly different (Fig. 2b). As in the above case, three configurations can be reached through interchanges between subtrees, but here all conditional likelihoods are lower.

The likelihood of the topology 1 (Figure 2b) can be computed as follows:


Formula

The likelihoods of topologies 2 and 3 are computed similarly.

It can be shown that by only doing NNIs, the root can be moved throughout the whole tree, to the exception of leaves: as NNIs keep internal branches internal (and external branches external), the root cannot be moved to a leaf.

Thus, the exploration through NNIs of the space of rooted tree topologies with a nonreversible model is as exhaustive as the exploration of the space of unrooted trees with a reversible model, except that the root cannot go to an external branch.

Once again, and in the same way as computing the likelihood of a tree with a nonreversible model of evolution can be considered as a generalization of the reversible case, the way the space of tree topologies can be explored by NNIs with a nonreversible model of evolution can be seen as a generalization of the reversible case. Overall, nonreversible models of evolution can easily fit into recent heuristics such as PhyML to search for most likely rooted phylogenies.

In the next part, we report a new program built on the algorithmic architecture of PhyML, which explores the space of tree topologies under the nonhomogeneous, nonstationary model of Galtier and Gouy (1998).


    nhPhyML, Adaptation of PhyML Algorithmic Structure to Galtier and Gouy's Model
 Top
 Abstract
 Computing the Likelihood of...
 nhPhyML, Adaptation of PhyML...
 Estimation of the Universal...
 Conclusion
 References
 
We adapted the fast heuristics of PhyML (Guindon and Gascuel, 2003) to Galtier and Gouy's nonhomogeneous and nonstationary model, and we report here results concerning the ability of the resulting nhPhyML program to explore the space of tree topologies and to estimate the ancestral G+C content.

Galtier and Gouy's model is particularly variable rich: in addition to common parameters such as branch lengths and transition/transversion ratio, it incorporates different equilibrium G+C contents for each branch and an additional G+C content at the root. This makes it a model containing 4n–2 variables, with n the number of taxa in the tree: 2n–3 branch lengths, 2n–2 equilibrium G+C contents, the G+C content at the root, the transition/transversion ratio, and an additional parameter defining the position of the root on its branch; i.e., the fraction of the branch length lying on the left side of the root. All these parameters are estimated in the maximum likelihood framework, which leads to a computationally intensive model. For this reason, in all the studies that used this model to find phylogenetic trees (Galtier et al., 1999; Tarrio et al., 2001; Herbeck et al., 2005), no exploration of the space of tree topologies was conducted; the model was simply used to compare a limited set of input phylogenies.

The use of very efficient heuristics to find most likely trees is then mandatory for this kind of model to be used not just as an evaluation tool. PhyML (Guindon and Gascuel, 2003) is such an algorithm that explores the space of tree topologies around an input phylogeny. The adaptation of Galtier and Gouy's model to the algorithmic structure of PhyML permits for the first time to use this nonstationary, nonhomogeneous model to explore the space of phylogenies with dozens of sequences.

PhyML's code was deeply modified to produce nhPhyML. nhPhyML starts from a user input–rooted topology: the choice of the root depends upon the user and is unchanged throughout the whole search for the most likely tree, except for its position on its branch. Even if we have shown that NNIs around the root branch could be easily implemented, this has not been done in nhPhyML. Data structures had to be slightly remodeled, as in Galtier and Gouy's model each branch has its own substitution matrix, and algorithms had to incorporate the fact that the root was fixed, both in the computation of likelihood values for alternate topologies obtained by NNIs (2) and in the computation of conditional likelihoods themselves. Those conditional likelihoods are computed almost as in PhyML, by first a postorder (the original Felsenstein's pruning algorithm; Felsenstein, 1981) and then a preorder tree traversal, but starting from the root of the tree, whereas in the reversible case the starting point could be any leaf.

Equations (2) and (3) show that lower conditional likelihoods depend, if at a leaf, upon the base observed in the sequence, or, if at an internal node, upon the values of the underlying nodes lower conditional likelihoods. Lower conditional likelihoods can then be obtained by a postorder tree traversal starting from the root node: the tree is traversed to its leaves, and then the conditional likelihoods of the upper nodes are computed, from the leaves up to the root. On the contrary, upper conditional likelihoods depend both on underlying nodes' lower conditional likelihoods and on above-lying nodes' upper conditional likelihoods (Equations (6) and (7)): upper conditional likelihoods can then be computed once lower conditional likelihoods have been computed, and with a preorder tree traversal.

All the parameters of the model are optimized with the Newton-Raphson method (Felsenstein and Churchill, 1996; Galtier and Gouy, 1998). Derivatives are computed analytically except for the shape parameter of the gamma distribution accounting for differences in substitution rates across sites (Yang, 1993) whose derivatives are computed numerically.

The topology is reorganized as in PhyML except that when estimating the approximate likelihood of a given NNI, not only the length but also the equilibrium G+C content of the internal branch around which NNIs are done are optimized.

Results obtained with nhPhyML suggested that the program was prone to getting trapped in local maxima. This prompted us to develop an approximate version of the Galtier and Gouy model, named nhPhyMLDiscrete.

We thus adopted a strategy inspired by Foster (2004) and only allowed a limited set of c equilibrium frequencies, themselves permanently set to Formula , ..., Formula . Each branch can still have its own equilibrium frequency, by choosing from the few ones available.

Three changes were introduced in the algorithm. First, the user sets the number c of equilibrium G+C frequencies. Second, before the exploration of the space of tree topologies, each equilibrium frequency is tested for each branch independently from the others, and the branch length is optimized for each equilibrium frequency. The best pairs (equilibrium frequency–branch length) are recorded and ordered according to the gain in likelihood they permit. When all branches have been tried, all the best values are simultaneously used. If the likelihood does not increase, only the first half of them, according to the order previously defined, are applied, until increase. This technique is very similar to the one used in PhyML to optimize branch lengths. Third, the space of tree topologies is explored as in PhyML, except that for each NNI that is tried, all the equilibrium frequencies are tried on the internal branch, and for each one the branch length is optimized.

Ability of nhPhyML to Explore the Space of Tree Topologies
In order to estimate the ability of nhPhyML to explore the tree topological space, we simulated the evolution of 1000-bp-long sequences according to Galtier and Gouy's model with a gamma-distributed rate across sites and the rooted version of the trees containing 40 leaves that were used to test PhyML (Guindon and Gascuel, 2003). The ancestral sequence G+C content was uniformly drawn from the interval [0.2; 0.8], and the equilibrium G+C frequencies were uniformly drawn at each node from the interval [0.1; 0.8], but the transition-transversion ratio was kept constant on the whole tree. We then applied various algorithms (neighbor joining, maximum parsimony and maximum likelihood with PhyML) to estimate their ability to find the topologies that had been used to simulate the evolution of the sequences (the "true topologies") and compared them to nhPhyML.

Among these algorithms, we distinguished programs that do not need a starting topology (like distance-based approaches) from the ones that reorganize a user input tree to explore the space of tree topologies (like nhPhyML). PhyML and the parsimony algorithm can be said to belong to the two classes, as they reorganize a starting topology that can be input a priori by the user or generated by the program itself. Two experiments were then conducted, one in which algorithms that do not need a user input topology were tested upon the simulated sequences (Fig. 3, white bars), and one in which algorithms that can run starting from a user input topology were compared (Fig. 3, grey bars). For this second experiment, input topologies were obtained by perturbing the "true topologies" by a number of NNIs uniformly drawn from [5;20], while making sure that the ingroup and the outgroup were not melted. This additional condition is necessary as nhPhyML is not able to question the position of the root when exploring the space of tree topologies.


Figure 3
View larger version (101K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Figure 3 Efficiency of various methods in reconstructing a phylogeny in nonhomogeneous conditions. White bars: Results obtained by phylogenetic methods that do not start from a user input topology. Grey bars: Results obtained by phylogenetic methods that start from perturbed input topologies. These results were obtained with the same 2000 different topologies taken from the PhyML test set. Error bars represent standard deviations. Asterisks are displayed where Student paired t-tests are significant at the 1% level.

 
To estimate the efficiency of a method, Robinson and Foulds' (R&F; Robinson and Foulds, 1979) average distances between the true topologies and the reconstructed ones were computed with the PHYLIP package (Felsenstein, 1989). Results are given in Figure 3.

For both PhyML (version 2.4.4, under the TN93 model [Tamura and Nei, 1993]) and nhPhyML, reconstruction was made using a gamma law with eight categories to account for across-site rate variation; parameter {alpha}, transition/transversion ratio, and the other parameters were estimated by the programs. PhyML and the parsimony method as implemented in PAUP* (Swofford, 2003) were both used from their built-in starting topology and from the perturbed input topologies. The neighbor-joining algorithm was applied to pairwise distances estimated under HKY85 (Hasegawa et al., 1985), LogDet (Lake, 1994; Lockhart et al., 1994), GG95 (Galtier and Gouy, 1995), and transversions-only observed divergence distances.

Figure 3 shows that PhyML is more efficient at finding good topologies than parsimony, which also has better results than distance methods. It is surprising to note that the transversions-only observed divergence and the GG95 distances perform worse than the HKY85 distance, because these methods were devised to be resistant to G+C content biases. However, as expected, the LogDet distance provides better results than the HKY85 distance.

Figure 3 (grey bars) compares the efficiencies of parsimony, PhyML, and nhPhyML methods. The average distance between the rearranged input topologies and the true topologies is shown. All the methods tested are able to explore the space of tree topologies to find better trees than the input ones. Results obtained by PhyML from the rearranged input phylogenies are better than when PhyML departs from its own starting topology (a distance-based tree). As the rearranged topologies are further away from the true topologies than the distance-based trees, this comes from the fact that distance trees fail on subtrees also difficult to solve for the maximum likelihood method, whereas the rearranged topologies can be perturbed at the level of subtrees whose solution is trivial.

It appears that PhyML is able to find better trees than parsimony and surprisingly also better trees than nhPhyML (Student unilateral paired t-test, P-value < 1 x 10–10). This might be due to the large number of parameters: nhPhyML has 2x n–2 additional parameters when compared to PhyML, because an equilibrium G+C content is associated to each branch. This might result in a likelihood surface with lots of local maxima, in which the algorithm would get trapped.

The comparison of the likelihood values found by nhPhyML when launched from the rearranged topologies to the likelihood values computed on the true topologies comforts us in this hypothesis: the log-likelihood of the true trees is on average 62.9 points higher than the log-likelihood of the trees found by nhPhyML, which have a better likelihood than the true trees in only 830 cases out of 2000. As a comparison, PhyML finds topologies with a 1.2-point higher log-likelihood score than the true topologies and finds topologies with better likelihoods than the true ones in 1562 cases out of 2000. This means that nhPhyML fails to correctly explore the space of tree topologies, because it gets trapped in local maxima and does not get to the real maximum. Being particularly parameter rich, it seems that nhPhyML can fit nearly any topology by taking advantage of its numerous parameters.

An approximate and less flexible model was developed and was named nhPhyML-Discrete. This nonhomogeneous model still has the same number of free parameters as nhPhyML, as each branch can have a particular equilibrium frequency, but is also much less "flexible," because these equilibrium frequencies are constrained to be in the limited set defined by the user. These constraints have a positive impact on the results of the algorithm. These are shown in Figure 3 for sets of 1 to 4 equilibrium frequencies.

nhPhyML-Discrete shows a better topological accuracy than PhyML even when using only one equilibrium frequency: this can be due either to the fact that the root base distribution is estimated in nhPhyML-Discrete and not in PhyML, or to the fact that the ingroup and the outgroup cannot be swapped in nhPhyML-Discrete, thereby avoiding the exploration of unreasonable tree topologies contrary to PhyML. Increasing the number of equilibrium frequencies further increases nhPhyML-Discrete's topological accuracy, but this tendency quickly reverses, as using 3 equilibrium frequencies yields better results than 4 equilibrium frequencies (though the unilateral paired Student t-test is not significant). When further increasing the number of equilibrium frequencies, the topological accuracy continues dropping, with, for instance, an average distance to the true topologies of 2.23 for 10 equilibrium frequencies (data not shown), not better than when using only 1 equilibrium frequency (2.18, Student unilateral paired t-test P-value: 0.054). Overall, it seems that using 3 equilibrium frequencies might be a good choice, as it gets the best topological accuracy on the simulations, which is the same performance as with 2 equilibrium frequencies, but with a lower standard deviation.

When used with 3 G+C equilibrium frequencies, the algorithm finds topologies closer to the true ones than PhyML (nhPhyML-Discrete: 2.09, PhyML: 2.49, Student unilateral paired t-test, P-value < 1 x 10–10). This also has an impact on the risk of getting trapped in local optima: nhPhyML-Discrete finds topologies that have a log-likelihood on average 37.5 points higher than the true topology log-likelihoods, which is better than nhPhyML. Moreover, it appears that nhPhyML-Discrete finds trees with better likelihoods than the true tree in 1768 cases out of 2000. Overall, it seems that nhPhyML-Discrete shows a performance as good as PhyML's one, with a stronger variation in log-likelihood, which might hint for a stronger discriminating power. Finally, this approximation has a great impact on the computational speed, nhPhyML used on average 40 min 42 s to give its results while nhPhyML-Discrete only needed 20 min 43 s: it is faster to choose among a limited set of equilibrium frequencies than to optimize the value of a continuous parameter.

Figure 4 shows that nhPhyML-Discrete is, as PhyML and parsimony, nearly insensitive to the distance of its input phylogeny to the true one. On the contrary, nhPhyML does not seem to be able to cope with distant topologies, which is in agreement with the results above.


Figure 4
View larger version (72K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Figure 4 Ability of phylogenetic methods to explore the space of tree topologies in nonhomogeneous conditions. As the distance from the input phylogenies to the true topologies increases, results obtained by nhPhyML get worse, but nhPhyML-Discrete seems to be less dependent upon the input topologies.

 
Estimation of Root G+C Content
The ancestral G+C content is a parameter of the model in itself. We conducted tests to check the ability of the program to estimate this parameter on trees containing 40 leaves, either from the true phylogenies or from the phylogenies found by nhPhyML and nhPhyML-Discrete when it was input perturbed topologies. nhPhyML-Discrete results are shown Figure 5 for 3 equilibrium frequencies.


Figure 5
View larger version (28K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Figure 5 Ability of nhPhyML-Discrete to estimate ancestral G+C contents. nhPhyML-Discrete was used to estimate ancestral G+C contents on sequences simulated on 2000 different topologies from the PhyML test set (see text) from the phylogenies found by the program itself, with 3 equilibrium frequency categories. In each experiment, parameter {alpha} that accounts for across-site rate variation using an 8-category discretized gamma distribution, transition/transversion ratio, and the other parameters were estimated by maximizing the likelihood.

 
Whether it is estimated from the true phylogenies, or from the phylogenies found by nhPhyML or nhPhyML-Discrete (Fig. 5), the ancestral G+C content is well estimated. The correlation coefficient between estimated and expected G+C contents is above 0.99 in all cases. Interestingly, results do not depend upon the number of equilibrium frequencies: performances are highly similar whether we use only 1 equilibrium frequency or whether we use nhPhyML. The average of the squared differences between the estimated and the true values is {approx}0.000282 when inferred from the true phylogeny, {approx}0.000292 when inferred from the phylogenies found by nhPhyML, and {approx}0.0003423 when inferred by nhPhyML-Discrete with 3 equilibrium frequencies from the phylogenies it has found. It is interesting to note that even if nhPhyML-Discrete does not model evolution as precisely as nhPhyML, being limited in its choice of equilibrium frequencies, it can still provide very good estimates of the ancestral G+C contents, even from topologies that are not the true ones.

Overall it appears that the limitation of the number and values of equilibrium frequencies has been very successful, permitting to increase the ability of the algorithm to explore the space of tree topologies while retaining the capacity to estimate ancestral G+C content.


    Estimation of the Universal Phylogeny and of the Ancestral G+C Content
 Top
 Abstract
 Computing the Likelihood of...
 nhPhyML, Adaptation of PhyML...
 Estimation of the Universal...
 Conclusion
 References
 
The ability of nhPhyML-Discrete to explore the space of tree topologies and to estimate the ancestral G+C content has then been demonstrated and can be applied to real data.

As rRNA stem G+C contents are known to be correlated to the optimal growth temperatures in Bacteria and Archaea (Galtier and Lobry, 1997), these genes are especially good candidates for an analysis using Galtier and Gouy's model (Galtier and Gouy, 1998), and hence nhPhyML-Discrete. Therefore, we reiterated the analysis of Galtier et al. (1999), improving the taxonomic sampling and benefiting from the heuristics of PhyML to better scan the space of tree topologies.

The analysis was divided in two steps: first we searched with nhPhyML-Discrete for the most likely topology for which complete rRNA genes and the model plead, and then, using the best topology we could find and the stem portion of rRNA sequences, we estimated the G+C content of the last universal common ancestor (LUCA) with nhPhyML.

Phylogeny Estimation
Small and large rRNA subunit sequences were downloaded from the European Ribosomal RNA database (Wuyts et al., 2004) and from generalist databases for a few missing sequences. Ninety-two species were selected, representing the three domains of life with 22 Archaea, 34 Bacteria, and 36 Eukaryota. Small- and large-subunit rRNA genes were concatenated, aligned using ClustalW (Chenna et al., 2003), and manually curated. The resulting alignment contained 2924 sites, with G+C contents ranging from 43% to 71%, which highlights the need for models robust to compositional biases.

Even if the ability of nhPhyML-Discrete to explore the space of tree topologies appears as good as PhyML's, it is wise to run the program from various starting topologies to diminish the risk of getting trapped in local optima. Moreover, as the process of evolution in not reversible, the position of the root influences the likelihood value. For this reason, it was decided to build various topologies with distance-based and parsimony methods (as implemented in Phylo_Win, Galtier et al., 1996), and then to try three different rootings for each of these topologies. The trees were rooted either on the branch leading to the Archaea, on the branch leading to the Bacteria, or on the branch leading to the Eukaryota. This produced 57 starting phylogenies, among which 9 had identical topologies.

Results obtained by nhPhyML-Discrete on these trees were then analyzed using CONSEL (Shimodaira and Hasegawa, 2001) and Treedist (PHYLIP package; Felsenstein, 1989). Eighteen phylogenies were found to be significantly more likely than the others (AU test P-values < 5%) among the 55 different topologies found, which shows that even if nhPhyML-Discrete showed good capabilities to explore the space of tree topologies, on real cases, with many sequences, the algorithm can get trapped in a local maximum, even when starting from two phylogenies with the same topologies but different branch lengths. None of these 18 topologies were identical, so the majority rule (extended) consensus tree obtained from these trees was built (Fig. 6) using Consense from the PHYLIP package (Felsenstein, 1989).


Figure 6
View larger version (198K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Figure 6 Consensus tree obtained from the 18 significantly most likely trees obtained by nhPhyML-Discrete. The tree was built as described in the text. Dashed lines represent branches that were not found in all the 18 most likely trees. The alignment contained 2924 sites. Group names are in agreement with the NCBI taxonomy. Sequence G+C contents range from 43% to 71%, which highlights the need to use models robust to compositional biases.

 
The Tree of Life
Though we do not believe that a two-gene phylogeny can clarify the Tree of Life, we think the analysis of rRNAs using a nonhomogeneous model might bear some interesting insights.

Most great clades are found monophyletic; e.g., Proteobacteria, Metazoa, Crenarchaeota. In the Bacteria kingdom, Firmicutes and Clostridia are found associated in all trees. Planctomycetes appear monophyletic and are associated to Chlamydiales but do not get a basal position as in Brochier and Philippe (2002): this result was found by getting rid of fastest evolving positions on the small subunit rRNA gene and is not found when coping with compositional heterogeneities on LSU and SSU rRNA genes. Instead, hyperthermophilic Bacteria are found at the root of the bacterial clade, as in first rRNA phylogenies (Woese, 1987).

Ancestral G+C Content Estimate
The consensus universal phylogeny found using nhPhyML-Discrete was used to infer the ancestral G+C contents of the small- and large-subunit rRNA genes stem regions. The inference was performed using only slowly evolving stem regions for two reasons. First, only rRNA stems, that is, the fraction of the rRNA molecule that folds as double helices, have a G+C content strongly correlated with optimal growth temperature (Galtier and Lobry, 1997). Second, Gowri-Shankar and Rattray (2006) have recently shown that the ancestral G+C content estimate obtained with the Galtier and Gouy model was biased towards the G+C content at slowly evolving sites and that equilibrium frequencies were biased towards fast-evolving sites. Correlations between interacting sites of the helices were not taken into consideration.

Stem regions were identified using the following procedure. The rRNA alignment downloaded from the European Ribosomal RNA database (Wuyts et al., 2004) was extended to the present sample of 92 sequences by manually aligning missing sequences using Seaview Galtier et al., (1996). A total of 1896 sites were predicted in stems in Escherichia coli, Archaeoglobus fulgidus, and Schizosaccharomyces pombe. Slowly evolving sites were identified by using the COE program (Dutheil et al., 2005) from the Bio++ package (Dutheil et al., 2006) and selecting sites predicted to undergo on average less than 0.1 substitution per branch under a HKY85 model with an 8-class discretized gamma law to model rate heterogeneity. Six hundred seventy-eight slowly evolving stem sites were finally retained.

The correlation between rRNA stem slowly evolving sites G+C content and optimal growth temperature (Topt) is high for both Bacteria (0.815) and Archaea (0.953), which is in agreement with Galtier and Lobry (1997). Because ancestral G+C content inferences are known to be robust with respect to the tree topology (Galtier et al., 1999; and Fig. 5 herein), those estimates are expected not to depend strongly on the input phylogeny.

Because the location of the root of the universal tree is not currently known (Brown and Doolittle, 1997; Forterre and Philippe, 1999), the likelihoods of all three possible rootings, that is, on the branch leading to each one of the three domains, were computed using nhPhyML on the slowly evolving stem sites. We chose not to use the discrete version of nhPhyML in order to avoid any bias that might arise from the fact that equilibrium frequencies are set to a priori values. For instance, with three G+C categories, the equilibrium frequencies are set to 0.25, 0.50, or 0.75, which may not model the evolution of the slowly evolving stem sites appropriately, given that their G+C content range from 0.52 for Entamoeba histolytica to 0.83 for Methanopyrus kandleri. Four rate categories were used to model rate heterogeneity, and the parameter {alpha} of the gamma distribution and the transition/transversion ratio were optimized by maximizing the likelihood. Since studies (Huelsenbeck et al., 2002; Yap and Speed, 2005) have shown that there may be information in extant sequences for the identification of the root position of a phylogenetic tree using nonreversible models of evolution, we compared all three likelihoods to see which root the model and the data were predicting. We used the slowly evolving stem sites, as they are expected to be less prone to saturation problems. The highest likelihood was obtained by the bacterial branch rooting (log-likelihood: –15,139.94, ancestral G+C content: 71.8%), whereas the least realistic rooting was found on the archaeal branch (log-likelihood: –15,147.22, ancestral G+C content: 75.1%) and could be rejected using CONSEL (Shimodaira and Hasegawa, 2001) (AU test P-value < 5%). Hence, we chose to discard the archaeal rooting from subsequent analyses. There was no significant difference between the bacterial and the eukaryotic rootings (log-likelihood: –15,141.61, ancestral G+C content: 69.2%). It seems interesting to note that the longest branch (the eukaryotic branch) was not found to provide the most likely rooting: the model is then able to find a signal that is independent from the evolutionary distance.

To estimate the accuracy of the estimation of ancestral G+C contents, we computed the likelihoods obtained when setting the ancestral G+C content to various values, between 55% and 85%, and compared the results using CONSEL. Ancestral G+C contents with AU test P-values higher than 5% were considered to be in the confidence interval. We found that, when rooting in the eukaryotic branch, ancestral G+C contents ranging from 63% to 75% could not be rejected, whereas when rooting in the bacterial branch, the confidence interval was [67%; 76%]. The larger confidence interval found for the eukaryotic rooting might be explained by the fact that this branch is considerably longer than the bacterial one: the extra length of the eukaryotic branch may provide more latitude to accomodate nonoptimal ancestral G+C contents.

Inferred ancestral G+C contents (Fig. 7) suggest a mesophilic (optimal growth temperature below 60°C) to thermophilic (optimal growth temperature between 60°C and 80°C) LUCA, in agreement with Galtier et al. (1999). Interestingly, both confidence intervals do not contain any value that seem to favour a hyperthermophilic LUCA. As a consequence, it appears that reducing site rate heterogeneity to avoid the bias put forth by Gowri-Shankar and Rattray (2006) does not contradict Galtier et al.'s conclusion.


Figure 7
View larger version (57K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Figure 7 Correlation between rRNA stem G+C contents and optimal growth temperature, and ancestral G+C content estimates. Circles: bacterial data; triangles: archaeal data. Estimates of the G+C content of the stem fraction slowly evolving sites of rRNA molecules at the tree root are indicated by vertical lines, continuous for the bacterial branch rooting and dashed for the eukaryotic branch rooting. Confidence intervals are represented by horizontal segments, continuous for the bacterial rooting, and dashed for the eukaryotic rooting. The linear regression is represented as a dotted line. An orthogonal regression provided very similar results.

 

    Conclusion
 Top
 Abstract
 Computing the Likelihood of...
 nhPhyML, Adaptation of PhyML...
 Estimation of the Universal...
 Conclusion
 References
 
In this article, we have shown that by reorganizing the way likelihood is computed, one can efficiently explore the space of tree topologies with a nonreversible model of evolution. We modified the PhyML algorithm (Guindon and Gascuel, 2003) to cope with the nonhomogeneous, nonstationary model of Galtier and Gouy (1998) and tested its abilities to find the right topology and to estimate the G+C content at the root. An approximate model was also tested, which showed good performance in both tree topological space exploration and ancestral G+C content estimation. We eventually used the program to estimate the topology of the Tree of Life from rRNA sequences and to estimate the ancestral stem G+C content by only selecting slowly evolving sites. The results agree with the ones obtained by Galtier and Lobry (1999) and support a nonhyperthermophilic last universal common ancestor.

Genome sequences vary widely in their composition between species. Therefore, when building a phylogenetic tree from such heterogeneous data, it is important to use a method robust to compositional biases. Nonhomogeneous models of evolution are particularly suitable, but their nonreversibility discarded them from most general phylogeny packages and prevented their use in large scale analyses. This work renders nonreversible models of evolution useful for phylogeny reconstruction, which considerably broadens the range of available models and opens new opportunities for models explicitly dealing with compositional biases.


    Acknowledgements
 
nhPhyML is available as a LINUX executable file at http://pbil.univ-lyon1.fr/software/nhphyml/ and as source code upon request. We want to thank the reviewers and Olivier Gascuel for their constructive remarks, which resulted in considerable improvement in the quality of the manuscript. This work was supported by Action Concertée Incitative IMPBIO. We thank the Centre de Calcul de l'IN2P3 for providing computer resources. Bastien Boussau acknowledges a PhD scholarship from the Centre National de la Recherche Scientifique. We also thank Mathilde Paris for fruitful discussions.


    References
 Top
 Abstract
 Computing the Likelihood of...
 nhPhyML, Adaptation of PhyML...
 Estimation of the Universal...
 Conclusion
 References
 

    Brochier C., Philippe H. Phylogeny: A non-hyperthermophilic ancestor for bacteria. Nature (2002) 417:244–244.[CrossRef][Medline]

    Brown J. R., Doolittle W. F. Archaea and the prokaryote-to-eukaryote transition. Microbiol Mol. Biol. Rev. (1997) 61:456–502.[Abstract/Free Full Text]

    Chenna R., Sugawara H., Koike T., Lopez R., Gibson T. J., Higgins D. G., Thompson J. D. Multiple sequence alignment with the Clustal series of programs. Nucleic Acids Res. (2003) 31:3497–3500.[Abstract/Free Full Text]

    Dutheil J., Pupko T., Jean-Marie A., Galtier N. A model-based approach for detecting coevolving positions in a molecule. Mol. Biol. Evol. (2005) 22:1919–1928.[Abstract/Free Full Text]

    Dutheil U., Gaillard U., Bazin U., Glemin U., Ranwez U., Galtier U., Belkhir U. Bio++: A set of C++ libraries for sequence analysis, phylogenetics, molecular evolution and population genetics. BMC Bioinformatics (2006) 7:188–188.[CrossRef][Medline]

    Felsenstein J. Evolutionary trees from DNA sequences: A maximum likelihood approach. J. Mol. Evol. (1981) 17:368–376.[CrossRef][Web of Science][Medline]

    Felsenstein J. Phylogeny inference package (version 3.2). Cladistics (1989) 5:164–166.

    Felsenstein J., Churchill G. A. A hidden Markov model approach to variation among sites in rate of evolution. Mol. Biol. Evol. (1996) 13:93–104.[Abstract]

    Forterre P., Philippe H. Where is the root of the universal tree of life? Bioessays (1999) 21:871–879.[CrossRef][Web of Science][Medline]

    Foster P. G. Modeling compositional heterogeneity. Syst. Biol. (2004) 53:485–495.[Abstract/Free Full Text]

    Galtier N., Gouy M. Inferring phylogenies from DNA sequences of unequal base compositions. Proc. Natl. Acad. Sci. USA (1995) 92:11317–11321.[Abstract/Free Full Text]

    Galtier N., Gouy M. Inferring pattern and process: Maximum-likelihood implementation of a nonhomogeneous model of DNA sequence evolution for phylogenetic analysis. Mol. Biol. Evol. (1998) 15:871–879.[Abstract]

    Galtier N., Gouy M., Gautier C. SEAVIEW and PHYLO_WIN: Two graphic tools for sequence alignment and molecular phylogeny. Cabios (1996) 12:543–548.[Medline]

    Galtier N., Lobry J. R. Relationships between genomic G+C content, RNA secondary structures, and optimal growth temperature in prokaryotes. J. Mol. Evol. (1997) 44:632–636.[CrossRef][Web of Science][Medline]

    Galtier N., Tourasse N., Gouy M. A nonhyperthermophilic common ancestor to extant life forms. Science (1999) 283:220–221.[Abstract/Free Full Text]

    Gowri-Shankar V., Rattray M. On the correlation between composition and site-specific evolutionary rate: Implications for phylogenetic inference. Mol. Biol. Evol. (2006) 23:352–364.[Abstract/Free Full Text]

    Guindon S., Gascuel O. A simple, fast, and accurate algorithm to estimate large phylogenies by maximum likelihood. Syst. Biol. (2003) 52:696–704.[Abstract/Free Full Text]

    Hasegawa M., Kishino H., Yano T. Dating of the human-ape splitting by a molecular clock of mitochondrial DNA. J. Mol. Evol. (1985) 22:160–174.[CrossRef][Web of Science][Medline]

    Herbeck J. T., Degnan P. H., Wernegreen J. J. Nonhomogeneous model of sequence evolution indicates independent origins of primary endosymbionts within the enterobacteriales (gamma-Proteobacteria). Mol. Biol. Evol. (2005) 22:520–532.[Abstract/Free Full Text]

    Huelsenbeck J. P., Bollback J. P., Levine A. M. Inferring the root of a phylogenetic tree. Syst. Biol. (2002) 51:32–43.[Abstract/Free Full Text]

    Lake J. A. Reconstructing evolutionary trees from DNA and protein sequences: Paralinear distances. Proc. Natl. Acad. Sci. USA (1994) 91:1455–1459.[Abstract/Free Full Text]

    Lockhart P. J., Steel M. A., Hendy M., Penny D. Recovering evolutionary trees under a more realistic model of sequence. Mol. Biol. Evol. (1994) 11:605–612.[Web of Science][Medline]

    Robinson D., Foulds L. Comparison of weighted labeled trees. In: Isomorphic factorisations VI: Automorphisms, combinatorial mathematics—Horadam A. F., Wallis W. D., eds. Springer. 119–126. 748 in Lecture Notes in Mathematics.

    Shimodaira H., Hasegawa M. CONSEL: For assessing the confidence of phylogenetic tree selection. Bioinformatics (2001) 17:1246–1247.[Abstract/Free Full Text]

    Stamatakis A., Ludwig T., Meier H. RAxML-III: A fast program for maximum likelihood-based inference of large phylogenetic trees. Bioinformatics (2005) 21:456–463.[Abstract/Free Full Text]

    Swofford D. L. PAUP* 2003. Phylogenetic analysis using parsimony (*and other methods), version 4.Sinauer Associates (2003) Sunderland: Massachusetts.

    Tamura K. Estimation of the number of nucleotide substitutions when there are strong transition-transversion and G+C-content biases. Mol. Biol. Evol. (1992) 9:678–687.[Abstract]

    Tamura K., Nei M. Estimation of the number of nucleotide substitutions in the control region of mitochondrial DNA in humans and chimpanzees. Mol. Biol. Evol. (1993) 10:512–526.[Abstract]

    Tarrio R., Rodriguez-Trelles F., Ayala F. J. Shared nucleotide composition biases among species and their impact on phylogenetic reconstructions of the Drosophilidae. Mol. Biol. Evol. (2001) 18:1464–1473.[Abstract/Free Full Text]

    Woese C. R. Bacterial evolution. Microbiol Rev. (1987) 51:221–271.[Free Full Text]

    Wuyts J., Perriére G., De Peer Van Y. The European Ribosomal RNA database. Nucleic Acids Res (2004) 32:D101–D103.[Abstract/Free Full Text]

    Yang Z. Maximum-likelihood estimation of phylogeny from DNA sequences when substitution rates differ over sites. Mol. Biol. Evol (1993) 10:1396–1401.[Abstract]

    Yang Z. Likelihood ratio tests for detecting positive selection and application to primate lysozyme evolution. Mol. Biol. Evol. (1998) 15:568–573.[Abstract]

    Yap V. B., Speed T. Rooting a phylogenetic tree with nonreversible substitution models. BMC Evol. Biol. (2005) 5(1):2. Jan 4.


Add to CiteULike CiteULike   Add to Connotea Connotea   Add to Del.icio.us Del.icio.us    What's this?


This article has been cited by other articles:


Home page
Syst BiolHome page
N. C. Sheffield, H. Song, S. L. Cameron, and M. F. Whiting
Nonstationary Evolution and Compositional Heterogeneity in Beetle Mitochondrial Phylogenomics
Syst Biol, August 1, 2009; 58(4): 381 - 394.
[Abstract] [Full Text] [PDF]


This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (PDF) Freely available
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in ISI Web of Science
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrow Search for citing articles in:
ISI Web of Science (13)
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Boussau, B.
Right arrow Articles by Gouy, M.
Right arrow Search for Related Content
PubMed
Right arrow Articles by Boussau, B.
Right arrow Articles by Gouy, M.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?