Skip Navigation

Systematic Biology 2006 55(4):685-691; doi:10.1080/10635150600889625
This Article
Right arrow Extract Freely available
Right arrow FREE Full Text (PDF) Freely available
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in ISI Web of Science
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrow Search for citing articles in:
ISI Web of Science (14)
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Blum, M. G. B.
Right arrow Articles by François, O.
Right arrow Search for Related Content
PubMed
Right arrow Articles by Blum, M. G. B.
Right arrow Articles by François, O.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?

© 2006 Society of Systematic Biologists

Which Random Processes Describe the Tree of Life? A Large-Scale Study of Phylogenetic Tree Imbalance

Edited by Mike Steel: Associate Editor

Michael G. B. Blum1 and Olivier François2

1 Department of Human Genetics, University of Michigan 2017 Palmer Commons, 100 Washtenaw Avenue, Ann Arbor, Michigan, 48109–2218, USA
2 TIMB Department of Mathematical Biology, Grenoble Universités TIMC UMR 5525, Fac. Méd, La Tronche cedex, F38706, France E-mail: Olivier.Francois{at}imag.fr (O.F.)

Received December 21, 2005; Revised February 11, 2006; Accepted March 27, 2006 The explosion of phylogenetic studies not only provides a clear snapshot of biodiversity, but also makes it possible to infer how the diversity has arisen (see, for example, Purvis and Hector, 2000; Harvey et al., 1996; Nee et al., 1996; Mace et al., 2003). To this aim, variation in speciation and extinction rates have been investigated through their signatures in the shapes of phylogenetic trees (Mooers and Heard, 1997). This issue is of great importance, as fitting stochastic models to tree data would help to understand underlying macroevolutionary processes. Although the prevailing view is that it does not represent phylogenies so well, the most popular model of phylogenetic trees is a branching process introduced by Yule, in which lineages split at random (Yule, 1924). Here we report the study of one major database of published phylogenies using the Yule model as well as several other models. Our results confirm the previous observation that the Yule model is inadequate to describe phylogenetic tree data. In addition, they support the hypothesis that many trees are consistent with a simple branch split model first considered by Aldous in 1996.

The analysis of stochastic models of phylogenetic tree shape, which began with Yule in 1924, was revived in the mid-1970s by the Woods Hole Group (Raup et al., 1973; Gould et al., 1977). Their model-based approach yielded the conclusion that lineages have varied in their potential for diversification. As a result of their work, much effort has recently been placed on understanding stochastic models of tree shape and their relationship to phylogenetic data. Among these models, the Equal Rate Markov (ERM, Yule) model is one of the simplest and most-often postulated as a null hypothesis for phylogenetic tree shape. In the ERM model each branch has an equal probability of splitting. Among others, Moran (1958) and Hey (1992) have also studied processes that share the same probability distribution of topologies as the Yule model. A second model—Proportional to Distinguishable Arrangements (PDA)—has also received a lot of attention. The PDA model has the property that all tree topologies are equally likely. Although less direct than for the ERM model, several interpretations of the PDA model in terms of evolutionary processes have been given in the past. For instance, Aldous (1991) found a correspondence between the PDA model and the genealogy of n species sampled from a critical branching process (i.e., including an extinction rate). McKenzie and Steel (2001) additionally established that explosive radiation processes can lead to the PDA model. More recently, Pinelis (2003) proved that multitype branching processes with species quasi-stabilization can also yield PDA-like trees.

One aspect of tree shape is particularly important when testing stochastic macroevolutionary models: tree balance. Tree balance usually refers to the topological structure of the tree, not considering the branch lengths. Early studies (Guyer and Slowinski, 1991; Heard, 1992; Guyer and Slowinski, 1993) agreed that reconstructed phylogenies were more imbalanced than was predicted by the ERM model. However, these studies were based on small samples of trees each of them with rather small size. Thus the need for new large-scale studies of phylogenetic tree imbalance has been emphasized several times (e.g., Aldous, 2001).

The ERM and the PDA models may both be viewed as branching Markov processes. Branching Markov processes are discrete recursive structures (cladograms) specified through symmetric split distributions, i.e., conditional probability distributions p(i|n) for the left sister clade size i given the parent clade size n. As shown by Harding (1971), the ERM model has the uniform split distribution throughout the tree. In other words, the probability that left sister clades contain i taxa is independent of i, and is equal to p(i|n) = 1/(n–1) for all internal nodes. For the PDA model, the split distribution is given by


Formula 1

(1)
where cn = (2n – 3)!! is the number of cladograms with n tips. Aldous' Branching (AB) model is defined by the following distribution


Formula 2

(2)
where hn is the nth harmonic number


Formula 3

(3)
The AB model corresponds to an intermediate state (β = –1) in a single-parameter family (beta-splitting) that encompasses both the ERM (β = 0) and PDA (β = –1.5) models


Formula 4

(4)
where {Gamma}(z) is the Gamma function (see Abramowitz and Stegun, 1970) and an(β) is a normalizing factor (Aldous, 1996). As an intermediate model, tree shape under the AB model differs significantly from that under the ERM and PDA models. For instance, at β = –1 the mean depth dn of a randomly chosen taxon in an n-species tree undergoes a "phase" transition. The order of dn is log n for β > –1 (in particular in the ERM model). As β decreases, it undergoes a sudden change at β = –1 (AB model), where it jumps to log 2n, and another change for –2 < β < –1 where it jumps again to n–β–1. Aldous (2001) also noticed that the β = –1 model produces a better fit to some data sets that does either the Markov or the PDA model. An attempt to describe the biological motivation for the beta-splitting model is postponed until the end of this Point of View.

Several measures of tree balance have been proposed in the literature (e.g., Colless, 1982; Agapow and Purvis, 2002; Felsenstein, 2003, chapter 33). For an n-species tree, we consider the following shape statistic


Formula 5

(5)
where the sum runs over all the internal nodes i, and Ni represents the number of extant descendants of internal node i (clade size). A similar statistic was proposed earlier by Chan and Moore (2002), but the logarithm was omitted. Once the normalizing constant has been removed, s corresponds to the logarithm of the probability of a tree in the ERM model (see Semple and Steel, 2003, 29–30). In particular, employing s in statistical tests warrants maximal power for rejecting the ERM against the PDA and conversely (this results from the theory of likelihood ratios; Edwards, 1972). In addition, Fill (1996) showed that under each model s has a Gaussian distribution (for large trees) and gave asymptotic expansions for the means and variances. Computer simulations (not shown) suggest that Gaussian approximations are also accurate for trees of moderate sizes. Nevertheless, our claim is not that the statistic s is generally superior to the previous ones. This statistic has been employed for the sake of convenience, and analyses based on the Colless (1982) index have led us to the same type of results as are described hereafter.

In this study we report a study of phylogenetic imbalance based on one major database—TreeBASE (http://www.treebase.org)—which serves as a searchable, archival repository of data and scientific references (Sanderson et al., 1994). In August 2005, TreeBASE contained 2063 phylogenetic trees with sizes ranging from 3 to 536, and 667 fully resolved binary trees with sizes ranging from 3 to 297. Many of these trees were rooted using one or two outgroup species. Because the outgroup taxa might contribute to an excess of imbalance (Heard, 1992), data were preprocessed by using an automatic outgroup removal procedure. This was done by identifying all trees in which one of the subtrees descended directly from the root had one or two taxa and removing that subtree. After the analysis, the automatic method was checked to produce results similar to true outgroup removal for 50 fully resolved binary trees (see Supplementary Material, http://systematicbiology.org). Two simulation methods were applied in order to solve polytomies by replacing the unresolved nodes either with ERM-like or with PDA-like subtrees. Both methods replaced the polytomic nodes by binary splits. Because there are two distinct available choices, binary splits were simulated according to either the ERM or PDA split distribution. Preprocessing and data analysis were performed using the "ape" and "apTreeshape" R packages (Paradis et al., 2004; Bortolussi et al., 2006).

The likelihood-ratio test based on s rejected the ERM model in 48 % of fully resolved trees and in 48 % of ERM solved trees (one-sided test, P < .05, the values were computed using a direct Monte Carlo method, 1000 replicates). Figure 1 displays the distribution of standardized shape indices together with the N(0,1) distribution predicted asymptotically by the ERM model. A deeper look at the data shows that tree shape undergoes a rapid change from the smaller to the intermediate-sized and larger trees. To investigate this transition, we performed maximum likelihood parameter estimation under the beta-splitting model for three sets of trees: fully resolved, ERM solved, and PDA solved (see Supplementary Material). Figure 2 displays the estimated values Formula for the first two data sets (third data set not shown, consistent with the others). A local regression (Cleveland et al., 1992) was performed on the estimated values. Figure 2 shows a rapid convergence around β {approx} – 0.95, very close to the AB model value β = – 1.


Figure 1
View larger version (39K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Figure 1 Histograms of the shape statistics serm after ERM standardization (sermE[serm])/{sigma}[serm] and density of the standard normal distribution. The means and variances have been estimated using 1000 Monte Carlo replicates. The dark bars in the histograms correspond to the test rejection area (P < .05). (A) Fully resolved trees from TreeBASE. (B) All trees with ERM simulation for solving polytomies.

 


Figure 2
View larger version (80K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Figure 2 Maximum likelihood estimates of the parameter β in the beta-splitting model. The AB model corresponds to the value β = –1. A local regression curve and the line β = –.95 are also plotted. (A) Fully resolved trees from TreeBASE. (B) All trees with ERM model simulation for solving polytomies.

 
Table 1 reports the median and variance of the maximum likelihood estimator Formula for intervals corresponding to the 20th-percentiles of tree sizes. The means of Formula were located at similar values and the variances decreased quickly as tree size increased. Note that the results for data within the smaller percentiles may not be very meaningful. The automatic outgroup removal procedure may have indeed added considerable balance to the smaller trees. The bias introduced may be particularly large in cases where the outgroups were not studied or were removed before submission of trees to the database.


View this table:
[in this window]
[in a new window]

 
Table 1 Median and variance estimates of the maximum likelihood estimator β for two datasets: a, Binary trees in TreeBASE. b, ERM-solved trees. The intervals are based on the 20% quantiles of the tree size distribution.

 
Figure 3 displays the values of the shape statistic s as a function of tree size, and the predictions of the ERM, AB, and PDA models computed from Monte Carlo replicates. In addition, we used MC replicates to estimate the P-values P(S ≥ s) for every tree under the three models. If the trees were in fact sampled from one of these models, the P-values would have uniform probability distribution for that model. The plots in Figure 4 demonstrates that uniform P-values occurred under the AB model only. Figure 3 and Figure 4 support the view that the AB model fits to the TreeBASE data rather well.


Figure 3
View larger version (26K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Figure 3 Values of the shape statistic s as a function of the number of taxa, and the values predicted by the ERM, AB, and PDA models for trees of size greater than 15. This graphic supports the fit of the AB model to TreeBASE data (fully resolved trees). The figure is plotted in a log-log scale.

 


Figure 4
View larger version (57K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Figure 4 U-plot: quantiles of the P-values of the shape statistic versus quantiles of the uniform distribution. The null models used to compute the P-values are the ERM, AB, and PDA models.

 
During the last decade, evolutionary biologists have often observed an excess of imbalance from phylogenetic data. In their review, Mooers and Heard (1997) reported several possible explanations for this phenomenon, including errors in molecular data, incompleteness of trees, and bias due to approximate reconstruction methods. Such criticisms still apply to the phylogenies stored in TreeBASE, but the use of numerous peer-reviewed entries must have reduced the biases from such errors and reconstruction methods. Incompleteness may, however, be contributing to the trend toward extra imbalance. The phenomenon may have been amplified in studies where species were removed from the analysis deliberately and selectively (see Mooers, 1995). Our main result says that the data generally agree with a very simple probabilistic model: Aldous' Branching. However, it leaves us with the issue of providing biological motivation for this model. One interpretation can start from the intuition that models with random diversification rates might be more appropriate for describing the Tree of Life than are models with deterministic rates. Some models of random diversification were investigated earlier by Heard (1996), who wrote that estimated trees from the literature correspond to very high, perhaps even implausibly high, levels of rate variation.

Here we will introduce a new model that quantifies the level of rate variation between species, and that bears strong resemblance to AB models (Equation (4)). Although one explanation will be proposed, we acknowledge that the interpretation of the AB model in terms of diversification rates will remain difficult. The following model can nevertheless suggest that AB-like data may well be explained by stochasticity acting at the level of diversification rates. The description starts from the deterministic rate biased-speciation model that shares similarities with the one introduced by Kirkpatrick and Slatkin (1993). Following speciation, the speciation rates of the sister species take a ratio x which is fixed for the entire clade. Speciation rates are assigned as follows. When a species with speciation rate {lambda} splits, one of its descendant species is given the speciation rate {lambda} p and the other a rate {lambda} (1–p) where p = x/(x+1) (the model can be viewed as parametrized by p rather than by x). These rates remain in effect until the two daughter species themselves speciate. These rates may seem unrealistic because they vanish as the continuous-time process evolves, but they can be easily corrected without influence on the tree topology (our only concern here). For the same reason, the initial value of {lambda} can be fixed to one. After some calculus (see Supplementary Material), we find that the process has split distribution given by


Formula 6

(6)

The novelty consists of adding a second level of randomness to the previous model assuming that, at each speciation event, the speciation rate p is a nondeterministic parameter with beta probability distribution


Formula 7

(7)
Averaging over the p's, we find that the unconditional model is now associated to the beta-binomial (BB) split distribution


Formula 8

(8)
where bn({alpha}) is a normalizing constant. A quick look at Equation (4) suggests that although Aldous' split distribution is very similar to Equation (8), the BB model gives strictly different clade split distributions. For {alpha} = 0, the ERM model (β = 0) is recovered by Equation (8), whereas {alpha} = –1 corresponds to the comb tree (β = – 2). The connection between the BB model and the AB model can be sketched as follows. For β in the range (–1, {infty}) the Beta splitting and BB families both rely on a binomial split distribution having its parameter p sampled from the beta density. However, the binomial distribution bin(n,p) gives positive weights to 0 and n whereas a split distribution must be over the set {1, ..., n–1}. Aldous' models consider the conditional distribution rejecting the two extreme values 0 and n. The BB model merely adds 1 to bin(n–2,p) samples, and hence produces a slightly different result. Nevertheless, the density curves for fixed p give strong evidence that the trajectories of the BB model come very close to the beta-splitting trajectories in the space of probability models on tree structures. For β = –1, there exist values of {alpha} for which the BB model approximates the AB model very accurately. Further details can be found in the Supplementary Material.

The results presented in this study may weaken the conclusions of previous works that assumes uniform branch split models for the Tree of Life and may hence limit the impact of theoretical predictions about evolutionary history that were made under such models. Structural parameters of the branching topology of the Tree of Life may indeed differ considerably. These results have been confirmed by a simultaneous study of the same data by D. Ford using others models (Ford, 2005). Our results suggest that alternative models with greater levels of tree imbalance than the ERM model may be more appropriate in further studies of large trees (e.g., AB model, nondeterministic speciation rates). Gould et al. (1977) wrote "How different then is the world from the stochastic system? ... The answer would seem to be not very." The results presented here suggest that the world is not very different from the stochastic system as long as the right stochastic system is considered.

The Supplementary Material is available from the Systematic Biology website:http://systematicbiology.org.


    Acknowledgements
 
The authors are grateful to Mike Steel for fruitful discussions at the "Mathematics of Evolution and Phylogeny Conference" held in Paris, July 2005, and for his review and syntheses that greatly contributed to improve the original manuscript. They wish to thank David Aldous and Arne Mooers for comments on an early version of the manuscript. They warmly thank Noah Rosenberg for his suggestions and assistance. The authors thank Stephen Heard, Rod Page, and an anonymous reviewer for their constructive remarks. OF was partly supported by grants from the IMAG Institute project ALPB, and the French ministry of research ACI-IMPBIO.


    References
 Top
 References
 

    Abramowitz M., Stegun I. A. Handbook of mathematical functions (1970) New York: Dover.

    Agapow P. M., Purvis A. Power of eight tree shape statistics to detect non-random diversification: A comparison by simulation of two models of cladogenesis. Syst. Biol. (2002) 51:866–872.[Abstract/Free Full Text]

    Aldous D. J. The continuum random tree II: An overview. In: Stochastic analysis—Barlow M. T., Bingham N. H., eds. (1991) Cambridge, UK: Cambridge University Press. 23–70.

    Aldous D. J. Probability distributions on cladograms. In: Random discrete structures—Aldous D. J., Pemantle R., eds. (1996) New York: Springer. 1–18.

    Aldous D. J. Stochastic models and descriptive statistics for phylogenetic trees, from Yule to Today. Stat. Sci. (2001) 16:23–34.[CrossRef][Web of Science]

    Bortolussi N., Durand E., Blum M. G. B., François O. ApTreeshape: Statistical analysis of phylogenetic treeshape. Bioinformatics (2006) 22:363–364.[Abstract/Free Full Text]

    Chan K. M. A., Moore B. R. Whole-tree tests of diversification rate variation. Syst. Biol. (2002) 51:855–865.[Abstract/Free Full Text]

    Cleveland W. S., Grosse E., Shyu W. M. Local regression models. In: Wadsworth Brooks/Cole—Chambers J. M., Hastie T. J., eds. (1992) Wadsworth, CA: Pacific Grove. Chapter 8 in Statistical models in S.

    Colless D. H. Review of phylogenetics: The theory and practice of phylogenetic systematics. Syst. Zool. (1982) 31:100–104.

    Edwards A. W. F. Likelihood (1972) Cambridge, UK: Cambridge University Press.

    Felsenstein J. Inferring phylogenies (2003) Sunderland, Massachusetts: Sinauer Associates.

    Fill J. A. On the distribution for binary search trees under the random permutation model. Random Struct. Algor. (1996) 8:1–25.[CrossRef]

    Ford D. Probabilities on cladogram: Introduction to the alpha model (2005) Arxiv preprint math-0511246, November 2005.

    Gould S. J., Raup D. M., Sepkowski J. J., Schopf T. J. M., Simberloff D. The shape of evolution: Comparison of real and random clades. Paleobiology (1977) 3:23–40.[Abstract]

    Guyer C., Slowinski J. B. Comparisons between observed phylogenetic topologies with null expectation among three monophyletic lineages. Evolution (1991) 45:340–350.[CrossRef][Web of Science]

    Guyer C., Slowinski J. B. Adaptive radiation and the topology of large phylogenies. Evolution (1993) 47:253–263.[CrossRef][Web of Science]

    Harding E. F. The probabilities of rooted tree-shapes generated by random bifurcation. Adv. Appl. Probab. (1971) 3:44–77.[CrossRef]

    Harvey P. H., Leigh Brown A. J., Maynard Smith J., Nee S. New uses for new phylogenies (1996) Oxford, UK: Oxford University Press.

    Heard S. B. Patterns in tree balance among cladistic, phenetic, and randomly generated phylogenetic trees. Evolution (1992) 46:1818–1826.[CrossRef][Web of Science]

    Heard S. B., Mooers A. Ø. Phylogenetically patterned speciation rates and extinction risks change the loss of evolutionary history during extinctions: Phylogenetically patterned speciation rates and extinction risks alter the calculus of biodiversity. Proc. R. Soc. Lond. B (2000) 267:613–620.[Medline]

    Heard S. B. Patterns in phylogenetic tree balance with variable and evolving speciation rates. Evolution (1996) 50:2141–2148.[CrossRef][Web of Science]

    Hey J. Using phylogenetic trees to study speciation and extinction. Evolution (1992) 46:627–640.[CrossRef][Web of Science]

    Kirkpatrick M., Slatkin M. Searching for evolutionary patterns in the shape of a phylogenetic tree. Evolution (1993) 47:1171–1181.[CrossRef][Web of Science]

    Mace G. M., Gittleman J. L., Purvis A. Preserving the Tree of Life. Science (2003) 300:1707–1709.[Abstract/Free Full Text]

    McKenzie A., Steel M. Properties of phylogenetic trees generated by Yuletype speciation models. Math. Biosci. (2000) 170:91–112.[Web of Science]

    Mooers A. Ø. Tree balance and tree completeness. Evolution (1995) 49:379–384.[CrossRef][Web of Science]

    Mooers A. Ø., Heard S. B. Inferring evolutionary process from phylogenetic tree shape. Q. Rev. Biol. (1997) 72:31–54.[CrossRef]

    Moran P. A. P. Random processes in genetics. Proc. Camb. Philos. Soc. (1958) 54:60–72.[CrossRef]

    Nee S., Barraclough T. G., Harvey P. H. Temporal changes in biodiversity: Detecting patterns and identifying causes. In: Biodiversity—Gaston K., ed. (1996) Oxford, UK: Oxford University Press. 230–252.

    Paradis E., Claude J., Strimmer K. APE: Analyses of phylogenetics and evolution in R language. Bioinformatics (2004) 20:289–290.[Abstract/Free Full Text]

    Pinelis I. Evolutionary models of phylogenetic trees. Proc. R. Soc. Lond. B. (2003) 270:1425–1431.[CrossRef][Medline]

    Purvis A., Hector A. Getting the measure of biodiversity. Nature (2000) 405:212–219.[CrossRef][Medline]

    Raup D. M., Gould S. J., Schopf T. J. M., Simberloff D. S. Stochastic models of phylogeny and the evolution of diversity. J. Geol. (1973) 81:525–542.[Web of Science]

    Sanderson M. J., Donoghue M. J., Piel W., Eriksson T. TreeBASE: A prototype database of phylogenetic analyses and an interactive tool for browsing the phylogeny of life. Am. J. Bot. (1994) 81:183–189.

    Semple C., Steel M. Phylogenetics (2003) Oxford, UK: Oxford University Press.

    Yule G. U. A mathematical theory of evolution, based on the conclusions of Dr J. C. Willis. Philos. Trans. Roy. Soc. London Ser. B (1924) 213:21–87.


Add to CiteULike CiteULike   Add to Connotea Connotea   Add to Del.icio.us Del.icio.us    What's this?


This article has been cited by other articles:


Home page
Syst BiolHome page
G. Li, M. Steel, and L. Zhang
More Taxa Are Not Necessarily Better for the Reconstruction of Ancestral Character States
Syst Biol, August 1, 2008; 57(4): 647 - 653.
[Full Text] [PDF]


Home page
Syst BiolHome page
T. A. Heath, D. J. Zwickl, J. Kim, and D. M. Hillis
Taxon Sampling Affects Inferences of Macroevolutionary Processes from Phylogenetic Trees
Syst Biol, February 1, 2008; 57(1): 160 - 166.
[Full Text] [PDF]


This Article
Right arrow Extract Freely available
Right arrow FREE Full Text (PDF) Freely available
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in ISI Web of Science
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrow Search for citing articles in:
ISI Web of Science (14)
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Blum, M. G. B.
Right arrow Articles by François, O.
Right arrow Search for Related Content
PubMed
Right arrow Articles by Blum, M. G. B.
Right arrow Articles by François, O.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?