© 2006 Society of Systematic Biologists
Which Random Processes Describe the Tree of Life? A Large-Scale Study of Phylogenetic Tree Imbalance
Edited by Mike Steel: Associate Editor
1 Department of Human Genetics, University of Michigan 2017 Palmer Commons, 100 Washtenaw Avenue, Ann Arbor, Michigan, 48109–2218, USA
2 TIMB Department of Mathematical Biology, Grenoble Universités TIMC UMR 5525, Fac. Méd, La Tronche cedex, F38706, France E-mail: Olivier.Francois{at}imag.fr (O.F.)
Received December 21, 2005; Revised February 11, 2006; Accepted March 27, 2006 The explosion of phylogenetic studies not only provides a clear snapshot of biodiversity, but also makes it possible to infer how the diversity has arisen (see, for example, Purvis and Hector, 2000; Harvey et al., 1996; Nee et al., 1996; Mace et al., 2003). To this aim, variation in speciation and extinction rates have been investigated through their signatures in the shapes of phylogenetic trees (Mooers and Heard, 1997). This issue is of great importance, as fitting stochastic models to tree data would help to understand underlying macroevolutionary processes. Although the prevailing view is that it does not represent phylogenies so well, the most popular model of phylogenetic trees is a branching process introduced by Yule, in which lineages split at random (Yule, 1924). Here we report the study of one major database of published phylogenies using the Yule model as well as several other models. Our results confirm the previous observation that the Yule model is inadequate to describe phylogenetic tree data. In addition, they support the hypothesis that many trees are consistent with a simple branch split model first considered by Aldous in 1996.
The analysis of stochastic models of phylogenetic tree shape, which began with Yule in 1924, was revived in the mid-1970s by the Woods Hole Group (Raup et al., 1973; Gould et al., 1977). Their model-based approach yielded the conclusion that lineages have varied in their potential for diversification. As a result of their work, much effort has recently been placed on understanding stochastic models of tree shape and their relationship to phylogenetic data. Among these models, the Equal Rate Markov (ERM, Yule) model is one of the simplest and most-often postulated as a null hypothesis for phylogenetic tree shape. In the ERM model each branch has an equal probability of splitting. Among others, Moran (1958) and Hey (1992) have also studied processes that share the same probability distribution of topologies as the Yule model. A second model—Proportional to Distinguishable Arrangements (PDA)—has also received a lot of attention. The PDA model has the property that all tree topologies are equally likely. Although less direct than for the ERM model, several interpretations of the PDA model in terms of evolutionary processes have been given in the past. For instance, Aldous (1991) found a correspondence between the PDA model and the genealogy of n species sampled from a critical branching process (i.e., including an extinction rate). McKenzie and Steel (2001) additionally established that explosive radiation processes can lead to the PDA model. More recently, Pinelis (2003) proved that multitype branching processes with species quasi-stabilization can also yield PDA-like trees.
One aspect of tree shape is particularly important when testing stochastic macroevolutionary models: tree balance. Tree balance usually refers to the topological structure of the tree, not considering the branch lengths. Early studies (Guyer and Slowinski, 1991; Heard, 1992; Guyer and Slowinski, 1993) agreed that reconstructed phylogenies were more imbalanced than was predicted by the ERM model. However, these studies were based on small samples of trees each of them with rather small size. Thus the need for new large-scale studies of phylogenetic tree imbalance has been emphasized several times (e.g., Aldous, 2001).
The ERM and the PDA models may both be viewed as branching Markov processes. Branching Markov processes are discrete recursive structures (cladograms) specified through symmetric split distributions, i.e., conditional probability distributions p(i|n) for the left sister clade size i given the parent clade size n. As shown by Harding (1971), the ERM model has the uniform split distribution throughout the tree. In other words, the probability that left sister clades contain i taxa is independent of i, and is equal to p(i|n) = 1/(n–1) for all internal nodes. For the PDA model, the split distribution is given by
|
| (1) |
|
| (2) |
|
| (3) |
|
| (4) |
(z) is the Gamma function (see Abramowitz and Stegun, 1970) and an(β) is a normalizing factor (Aldous, 1996). As an intermediate model, tree shape under the AB model differs significantly from that under the ERM and PDA models. For instance, at β = –1 the mean depth dn of a randomly chosen taxon in an n-species tree undergoes a "phase" transition. The order of dn is log n for β > –1 (in particular in the ERM model). As β decreases, it undergoes a sudden change at β = –1 (AB model), where it jumps to log 2n, and another change for –2 < β < –1 where it jumps again to n–β–1. Aldous (2001) also noticed that the β = –1 model produces a better fit to some data sets that does either the Markov or the PDA model. An attempt to describe the biological motivation for the beta-splitting model is postponed until the end of this Point of View.
Several measures of tree balance have been proposed in the literature (e.g., Colless, 1982; Agapow and Purvis, 2002; Felsenstein, 2003, chapter 33). For an n-species tree, we consider the following shape statistic
|
| (5) |
In this study we report a study of phylogenetic imbalance based on one major database—TreeBASE (http://www.treebase.org)—which serves as a searchable, archival repository of data and scientific references (Sanderson et al., 1994). In August 2005, TreeBASE contained 2063 phylogenetic trees with sizes ranging from 3 to 536, and 667 fully resolved binary trees with sizes ranging from 3 to 297. Many of these trees were rooted using one or two outgroup species. Because the outgroup taxa might contribute to an excess of imbalance (Heard, 1992), data were preprocessed by using an automatic outgroup removal procedure. This was done by identifying all trees in which one of the subtrees descended directly from the root had one or two taxa and removing that subtree. After the analysis, the automatic method was checked to produce results similar to true outgroup removal for 50 fully resolved binary trees (see Supplementary Material, http://systematicbiology.org). Two simulation methods were applied in order to solve polytomies by replacing the unresolved nodes either with ERM-like or with PDA-like subtrees. Both methods replaced the polytomic nodes by binary splits. Because there are two distinct available choices, binary splits were simulated according to either the ERM or PDA split distribution. Preprocessing and data analysis were performed using the "ape" and "apTreeshape" R packages (Paradis et al., 2004; Bortolussi et al., 2006).
The likelihood-ratio test based on s rejected the ERM model in 48 % of fully resolved trees and in 48 % of ERM solved trees (one-sided test, P < .05, the values were computed using a direct Monte Carlo method, 1000 replicates). Figure 1 displays the distribution of standardized shape indices together with the N(0,1) distribution predicted asymptotically by the ERM model. A deeper look at the data shows that tree shape undergoes a rapid change from the smaller to the intermediate-sized and larger trees. To investigate this transition, we performed maximum likelihood parameter estimation under the beta-splitting model for three sets of trees: fully resolved, ERM solved, and PDA solved (see Supplementary Material). Figure 2 displays the estimated values
for the first two data sets (third data set not shown, consistent with the others). A local regression (Cleveland et al., 1992) was performed on the estimated values. Figure 2 shows a rapid convergence around β
– 0.95, very close to the AB model value β = – 1.
|
|
Table 1 reports the median and variance of the maximum likelihood estimator
|
Figure 3 displays the values of the shape statistic s as a function of tree size, and the predictions of the ERM, AB, and PDA models computed from Monte Carlo replicates. In addition, we used MC replicates to estimate the P-values P(S
s) for every tree under the three models. If the trees were in fact sampled from one of these models, the P-values would have uniform probability distribution for that model. The plots in Figure 4 demonstrates that uniform P-values occurred under the AB model only. Figure 3 and Figure 4 support the view that the AB model fits to the TreeBASE data rather well.
|
|
During the last decade, evolutionary biologists have often observed an excess of imbalance from phylogenetic data. In their review, Mooers and Heard (1997) reported several possible explanations for this phenomenon, including errors in molecular data, incompleteness of trees, and bias due to approximate reconstruction methods. Such criticisms still apply to the phylogenies stored in TreeBASE, but the use of numerous peer-reviewed entries must have reduced the biases from such errors and reconstruction methods. Incompleteness may, however, be contributing to the trend toward extra imbalance. The phenomenon may have been amplified in studies where species were removed from the analysis deliberately and selectively (see Mooers, 1995). Our main result says that the data generally agree with a very simple probabilistic model: Aldous' Branching. However, it leaves us with the issue of providing biological motivation for this model. One interpretation can start from the intuition that models with random diversification rates might be more appropriate for describing the Tree of Life than are models with deterministic rates. Some models of random diversification were investigated earlier by Heard (1996), who wrote that estimated trees from the literature correspond to very high, perhaps even implausibly high, levels of rate variation.
Here we will introduce a new model that quantifies the level of rate variation between species, and that bears strong resemblance to AB models (Equation (4)). Although one explanation will be proposed, we acknowledge that the interpretation of the AB model in terms of diversification rates will remain difficult. The following model can nevertheless suggest that AB-like data may well be explained by stochasticity acting at the level of diversification rates. The description starts from the deterministic rate biased-speciation model that shares similarities with the one introduced by Kirkpatrick and Slatkin (1993). Following speciation, the speciation rates of the sister species take a ratio x which is fixed for the entire clade. Speciation rates are assigned as follows. When a species with speciation rate
splits, one of its descendant species is given the speciation rate
p and the other a rate
(1–p) where p = x/(x+1) (the model can be viewed as parametrized by p rather than by x). These rates remain in effect until the two daughter species themselves speciate. These rates may seem unrealistic because they vanish as the continuous-time process evolves, but they can be easily corrected without influence on the tree topology (our only concern here). For the same reason, the initial value of
can be fixed to one. After some calculus (see Supplementary Material), we find that the process has split distribution given by
|
| (6) |
The novelty consists of adding a second level of randomness to the previous model assuming that, at each speciation event, the speciation rate p is a nondeterministic parameter with beta probability distribution
|
| (7) |
|
| (8) |
) is a normalizing constant. A quick look at Equation (4) suggests that although Aldous' split distribution is very similar to Equation (8), the BB model gives strictly different clade split distributions. For
= 0, the ERM model (β = 0) is recovered by Equation (8), whereas
= –1 corresponds to the comb tree (β = – 2). The connection between the BB model and the AB model can be sketched as follows. For β in the range (–1,
) the Beta splitting and BB families both rely on a binomial split distribution having its parameter p sampled from the beta density. However, the binomial distribution bin(n,p) gives positive weights to 0 and n whereas a split distribution must be over the set {1, ..., n–1}. Aldous' models consider the conditional distribution rejecting the two extreme values 0 and n. The BB model merely adds 1 to bin(n–2,p) samples, and hence produces a slightly different result. Nevertheless, the density curves for fixed p give strong evidence that the trajectories of the BB model come very close to the beta-splitting trajectories in the space of probability models on tree structures. For β = –1, there exist values of
for which the BB model approximates the AB model very accurately. Further details can be found in the Supplementary Material. The results presented in this study may weaken the conclusions of previous works that assumes uniform branch split models for the Tree of Life and may hence limit the impact of theoretical predictions about evolutionary history that were made under such models. Structural parameters of the branching topology of the Tree of Life may indeed differ considerably. These results have been confirmed by a simultaneous study of the same data by D. Ford using others models (Ford, 2005). Our results suggest that alternative models with greater levels of tree imbalance than the ERM model may be more appropriate in further studies of large trees (e.g., AB model, nondeterministic speciation rates). Gould et al. (1977) wrote "How different then is the world from the stochastic system? ... The answer would seem to be not very." The results presented here suggest that the world is not very different from the stochastic system as long as the right stochastic system is considered.
The Supplementary Material is available from the Systematic Biology website:http://systematicbiology.org.
| Acknowledgements |
|---|
The authors are grateful to Mike Steel for fruitful discussions at the "Mathematics of Evolution and Phylogeny Conference" held in Paris, July 2005, and for his review and syntheses that greatly contributed to improve the original manuscript. They wish to thank David Aldous and Arne Mooers for comments on an early version of the manuscript. They warmly thank Noah Rosenberg for his suggestions and assistance. The authors thank Stephen Heard, Rod Page, and an anonymous reviewer for their constructive remarks. OF was partly supported by grants from the IMAG Institute project ALPB, and the French ministry of research ACI-IMPBIO.
| References |
|---|
|
|
|---|
-
Abramowitz M., Stegun I. A. Handbook of mathematical functions (1970) New York: Dover.
Agapow P. M., Purvis A. Power of eight tree shape statistics to detect non-random diversification: A comparison by simulation of two models of cladogenesis. Syst. Biol. (2002) 51:866–872.
Aldous D. J. The continuum random tree II: An overview. In: Stochastic analysis—Barlow M. T., Bingham N. H., eds. (1991) Cambridge, UK: Cambridge University Press. 23–70.
Aldous D. J. Probability distributions on cladograms. In: Random discrete structures—Aldous D. J., Pemantle R., eds. (1996) New York: Springer. 1–18.
Aldous D. J. Stochastic models and descriptive statistics for phylogenetic trees, from Yule to Today. Stat. Sci. (2001) 16:23–34.[CrossRef][Web of Science]
Bortolussi N., Durand E., Blum M. G. B., François O. ApTreeshape: Statistical analysis of phylogenetic treeshape. Bioinformatics (2006) 22:363–364.
Chan K. M. A., Moore B. R. Whole-tree tests of diversification rate variation. Syst. Biol. (2002) 51:855–865.
Cleveland W. S., Grosse E., Shyu W. M. Local regression models. In: Wadsworth Brooks/Cole—Chambers J. M., Hastie T. J., eds. (1992) Wadsworth, CA: Pacific Grove. Chapter 8 in Statistical models in S.
Colless D. H. Review of phylogenetics: The theory and practice of phylogenetic systematics. Syst. Zool. (1982) 31:100–104.
Edwards A. W. F. Likelihood (1972) Cambridge, UK: Cambridge University Press.
Felsenstein J. Inferring phylogenies (2003) Sunderland, Massachusetts: Sinauer Associates.
Fill J. A. On the distribution for binary search trees under the random permutation model. Random Struct. Algor. (1996) 8:1–25.[CrossRef]
Ford D. Probabilities on cladogram: Introduction to the alpha model (2005) Arxiv preprint math-0511246, November 2005.
Gould S. J., Raup D. M., Sepkowski J. J., Schopf T. J. M., Simberloff D. The shape of evolution: Comparison of real and random clades. Paleobiology (1977) 3:23–40.[Abstract]
Guyer C., Slowinski J. B. Comparisons between observed phylogenetic topologies with null expectation among three monophyletic lineages. Evolution (1991) 45:340–350.[CrossRef][Web of Science]
Guyer C., Slowinski J. B. Adaptive radiation and the topology of large phylogenies. Evolution (1993) 47:253–263.[CrossRef][Web of Science]
Harding E. F. The probabilities of rooted tree-shapes generated by random bifurcation. Adv. Appl. Probab. (1971) 3:44–77.[CrossRef]
Harvey P. H., Leigh Brown A. J., Maynard Smith J., Nee S. New uses for new phylogenies (1996) Oxford, UK: Oxford University Press.
Heard S. B. Patterns in tree balance among cladistic, phenetic, and randomly generated phylogenetic trees. Evolution (1992) 46:1818–1826.[CrossRef][Web of Science]
Heard S. B., Mooers A. Ø. Phylogenetically patterned speciation rates and extinction risks change the loss of evolutionary history during extinctions: Phylogenetically patterned speciation rates and extinction risks alter the calculus of biodiversity. Proc. R. Soc. Lond. B (2000) 267:613–620.[Medline]
Heard S. B. Patterns in phylogenetic tree balance with variable and evolving speciation rates. Evolution (1996) 50:2141–2148.[CrossRef][Web of Science]
Hey J. Using phylogenetic trees to study speciation and extinction. Evolution (1992) 46:627–640.[CrossRef][Web of Science]
Kirkpatrick M., Slatkin M. Searching for evolutionary patterns in the shape of a phylogenetic tree. Evolution (1993) 47:1171–1181.[CrossRef][Web of Science]
Mace G. M., Gittleman J. L., Purvis A. Preserving the Tree of Life. Science (2003) 300:1707–1709.
McKenzie A., Steel M. Properties of phylogenetic trees generated by Yuletype speciation models. Math. Biosci. (2000) 170:91–112.[Web of Science]
Mooers A. Ø. Tree balance and tree completeness. Evolution (1995) 49:379–384.[CrossRef][Web of Science]
Mooers A. Ø., Heard S. B. Inferring evolutionary process from phylogenetic tree shape. Q. Rev. Biol. (1997) 72:31–54.[CrossRef]
Moran P. A. P. Random processes in genetics. Proc. Camb. Philos. Soc. (1958) 54:60–72.[CrossRef]
Nee S., Barraclough T. G., Harvey P. H. Temporal changes in biodiversity: Detecting patterns and identifying causes. In: Biodiversity—Gaston K., ed. (1996) Oxford, UK: Oxford University Press. 230–252.
Paradis E., Claude J., Strimmer K. APE: Analyses of phylogenetics and evolution in R language. Bioinformatics (2004) 20:289–290.
Pinelis I. Evolutionary models of phylogenetic trees. Proc. R. Soc. Lond. B. (2003) 270:1425–1431.[CrossRef][Medline]
Purvis A., Hector A. Getting the measure of biodiversity. Nature (2000) 405:212–219.[CrossRef][Medline]
Raup D. M., Gould S. J., Schopf T. J. M., Simberloff D. S. Stochastic models of phylogeny and the evolution of diversity. J. Geol. (1973) 81:525–542.[Web of Science]
Sanderson M. J., Donoghue M. J., Piel W., Eriksson T. TreeBASE: A prototype database of phylogenetic analyses and an interactive tool for browsing the phylogeny of life. Am. J. Bot. (1994) 81:183–189.
Semple C., Steel M. Phylogenetics (2003) Oxford, UK: Oxford University Press.
Yule G. U. A mathematical theory of evolution, based on the conclusions of Dr J. C. Willis. Philos. Trans. Roy. Soc. London Ser. B (1924) 213:21–87.
This article has been cited by other articles:
![]() |
G. Li, M. Steel, and L. Zhang More Taxa Are Not Necessarily Better for the Reconstruction of Ancestral Character States Syst Biol, August 1, 2008; 57(4): 647 - 653. [Full Text] [PDF] |
||||
![]() |
T. A. Heath, D. J. Zwickl, J. Kim, and D. M. Hillis Taxon Sampling Affects Inferences of Macroevolutionary Processes from Phylogenetic Trees Syst Biol, February 1, 2008; 57(1): 160 - 166. [Full Text] [PDF] |
||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||



[serm] and density of the standard normal distribution. The means and variances have been estimated using 1000 Monte Carlo replicates. The dark bars in the histograms correspond to the test rejection area (P < .05). (A) Fully resolved trees from TreeBASE. (B) All trees with ERM simulation for solving polytomies.




