Skip Navigation

Systematic Biology 2008 57(2):243-250; doi:10.1080/10635150802033014
This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (PDF) Freely available
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in ISI Web of Science
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrow Search for citing articles in:
ISI Web of Science (3)
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Steel, M.
Right arrow Articles by Rodrigo, A.
Right arrow Search for Related Content
PubMed
Right arrow Articles by Steel, M.
Right arrow Articles by Rodrigo, A.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?

© 2008 Society of Systematic Biologists

Maximum Likelihood Supertrees

Mike Steel1 and Allen Rodrigo2

1 Allan Wilson Centre for Molecular Ecology and Evolution, Department of Mathematics and Statistics, University of Canterbury Christchurch, New Zealand; E-mail: m.steel{at}math.canterbury.ac.nz
2 The Bioinformatics Institute and the Allan Wilson Centre for Molecular Ecology and Evolution, University of Auckland New Zealand and Laboratoire d'Informatique, de Robotique et de Microelectronique de Montpellier France; E-mail: a.rodrigo{at}auckland.ac.nz


    Abstract
 Top
 Abstract
 Terminology
 An Exponential Model of...
 Statistical Consistency of ML...
 Relation to MRP and...
 Statistical Consistency of ML...
 Discussion
 Appendix. Proofs of Main...
 References
 
We analyze a maximum likelihood approach for combining phylogenetic trees into a larger "supertree." This is based on a simple exponential model of phylogenetic error, which ensures that ML supertrees have a simple combinatorial description (as a median tree, minimizing a weighted sum of distances to the input trees). We show that this approach to ML supertree reconstruction is statistically consistent (it converges on the true species supertree as more input trees are combined), in contrast to the widely used MRP method, which we show can be statistically inconsistent under the exponential error model. We also show that this statistical consistency extends to an ML approach for constructing species supertrees from gene trees. In this setting, incomplete lineage sorting (due to coalescence rates of homologous genes being lower than speciation rates) has been shown to lead to gene trees that are frequently different from species trees, and this can confound efforts to reconstruct the species phylogeny correctly.

Keywords: Gene tree; maximum likelihood; phylogenetic supertree; species tree; statistical consistency

Received August 15, 2007; Revised November 19, 2007; Accepted January 15, 2008


Combining trees on different, overlapping sets of taxa into a parent "supertree" is now a mainstream strategy for constructing large phylogenetic trees. The literature on supertrees is growing steadily: new methods of supertree reconstruction are being developed (Cotton and Wilkinson, 2007) and supertree analyses are shedding light on fundamental evolutionary questions (Bininda-Emonds et al., 2007). Despite this surge in research activity, it is probably fair to say that biologists are still confused about what supertrees really are and what it is we do when we build a supertree. Are we, as some maintain, simply summarizing the phylogenetic information contained in a group of subtrees? Or are we trying to derive the best estimate of phylogeny given the information at hand? Nor is it clear which of these two conceptually different objectives underpin the various supertree reconstruction methods.

We take the view that what biologists really want a supertree reconstruction method to deliver is the best hypothesis of evolutionary relationships that can be inferred from the data available. Obviously, it is not the case that the supertree constructed as a summary statistic will necessarily be the best estimate of phylogeny. Nonetheless, if we are prepared to consider supertree reconstruction a problem of phylogenetic estimation, we have at our disposal an arsenal of phylogenetic tools and methods that have been tried and tested. Matrix representation with parsimony (MRP; Baum and Ragan, 1992), weighted MRP (Bininda-Emonds and Sanderson, 2001), matrix representation with compatibility (MRC; Rodrigo, 1996; Ross and Rodrigo, 2004), and, most recently, Bayesian supertree reconstruction (BSR; Ronquist et al., 2004) are undoubtedly inspired by standard phylogenetic methods. A gap remains, though, as there has been remarkably little development of likelihood-based methods for supertree reconstruction.

In this paper, we analyze an approach to obtain maximum likelihood (ML) estimates of supertrees, based on a probability model that permits errors in subtree topologies. The approach is of the type described by Cotton and Page (2004), and it permits supertrees to be estimated even if there is topological conflict amongst the constituent subtrees. We show that ML estimates of supertrees so obtained are statistically consistent under fairly general conditions. By contrast, we show that MRP may be inconsistent under these same conditions. We then consider a further complication that arises in the supertree setting when combining gene trees into species trees—in addition to the possibility that the input gene trees are reconstructed incorrectly (either a consequence of the reconstruction method used, or some sampling error), there is a further stochastic process that leads to the (true) gene trees differing from their underlying species tree (a consequence of incomplete lineage sorting). Although simple strategies such as gene concatenation have recently been shown to be potentially misleading (Degnan and Rosenberg, 2006), we show that an ML supertree approach for combining gene trees is also statistically consistent.


    Terminology
 Top
 Abstract
 Terminology
 An Exponential Model of...
 Statistical Consistency of ML...
 Relation to MRP and...
 Statistical Consistency of ML...
 Discussion
 Appendix. Proofs of Main...
 References
 
Throughout this paper, unless stated otherwise, phylogenetic trees may be either rooted or unrooted, and we will mostly follow the notation of Semple and Steel (2003). In particular, given a (rooted or unrooted) phylogenetic tree Formula on a finite set X of taxa (which will always label the leaves of the tree), any subset Y of X induces a phylogenetic tree on taxon set Y, denoted Formula |Y, which, informally, is the subtree of Formula that connects the taxa in Y only. In the supertree problem, we have a sequence Formula = (Formula 1, Formula 2, ..., Formula k) of input trees, called a profile, where Formula i is a phylogenetic tree on taxon set Xi. We wish to combine these trees into a phylogenetic tree Formula on the union of the taxon sets (i.e., X = X1 {cup} X2 {cup} ... {cup} Xk). We assume that the trees in Formula are either all rooted or all unrooted, and that Formula is rooted or unrooted accordingly. We will mostly assume that trees are fully resolved (i.e., binary trees, without polytomies); in a remark following Theorem 4, we briefly describe how this restriction can be lifted. Furthermore, in this paper we consider just the tree topology, not the branch lengths.

A special case of the supertree problem arises when the taxon sets of the input trees are all the same (X1 = X2 = ... = Xk). This is the much studied consensus tree problem. In an early paper, McMorris (1990) described how, in this consensus setting, the majority-rule consensus tree can be given a maximum likelihood interpretation. However, this approach is quite different from the one described here (it is restricted to the consensus setting and is based on a very particular probability model).

In this paper, we will denote the underlying ("true") species tree as Formula 0 (assuming that such a tree exists and that the evolution of the taxa has not involved reticulate processes such as the formation of hybrid taxa). In an ideal world, we would like Formula i = Formula 0 | Xi for each tree Formula i in the profile—that is, we would like each of the reconstructed trees to be identical to the subtree of the true tree for the taxa in X0. But in practice, the trees Formula 1,...,Formula k are unlikely to even be compatible (i.e., no phylogenetic tree Formula exists for which Formula i = Formula |Xi for all i).


    An Exponential Model of Phylogenetic Error
 Top
 Abstract
 Terminology
 An Exponential Model of...
 Statistical Consistency of ML...
 Relation to MRP and...
 Statistical Consistency of ML...
 Discussion
 Appendix. Proofs of Main...
 References
 
Species trees that have been inferred from data may differ from the true underlying species tree for numerous reasons, including sampling effects (short and/or site-saturated sequences, or poorly defined characters), model violation, sequencing or alignment errors, and so forth. In this section, we will assume a simple model of phylogenetic error in which the probability of observing a given tree falls off exponentially with its distance from an underlying generating tree (e.g., the true species tree Formula 0). This type of model has been described by Holmes (2003) in the general setting of Bayesian statistical analysis in phylogenetics. Suppose d is some metric on resolved phylogenetic trees. There are several possible choices for d, depending on the biological context and computational considerations, and we describe some of these shortly. In the exponential model, the probability, denoted PFormula , Y[Formula '] (or more briefly PFormula [Formula ']) of reconstructing any species tree Formula ' on any taxon set Y (where Y sub X), when Formula is the generating tree (on taxon set X) is proportional to an exponentially decaying function of the distance from Formula ' to Formula |Y. In other words,


Formula 1

(1)
The constant β can vary with Y and other factors (such as the quality of the data); for example, trees constructed from long high-fidelity sequences are likely to have a larger β than trees constructed from shorter and/or noisier sequences. The constant {alpha} is simply a normalizing constant to ensure that {sum}Formula 'PFormula , Y[Formula '] = 1, where the sum is over all fully resolved phylogenetic trees Formula ' on taxon set Y. When we have a sequence (X1, X2, ...) of subsets of X, we will reflect the dependence of {alpha}, β on Xi by writing {alpha}i and βi. Note that {alpha}i is determined entirely by βi and |Xi|.

Note that, implicit in (1), the probability of Formula ' depends only on the subtree of Formula connecting the species in Formula ' and not on the other species in Formula that are not present in Formula '.

Now, suppose we observe the profile of trees Formula = (Formula 1, Formula 2, ..., Formula k) as above, where Formula i has leaf set Xi. Assume that, for each i, the tree Formula i has been independently sampled from the exponential distribution (1) with β = βi. Select a phylogenetic X tree Formula that maximizes the probability (PFormula (Formula 1, ..., Formula k)) of generating the observed profile Formula —we call this type of a tree Formula an ML supertree for Formula .

An ML supertree has a simple combinatorial description as a (weighted) median tree, as the following result shows. Note that the use of sums of tree-to-tree distances as optimality criteria has previously been suggested by Wilkinson et al. (2001).

Proposition 1.For any metric d on phylogenetic trees, an ML supertree for a profile Formula under the exponential model (1), is precisely a tree Formula that minimizes the weighted sum:


Formula

Proof. By the independence assumption,


Formula

and, by (1), PFormula , Xi [Formula i] = {alpha}iexp [–βid(Formula i, Formula |Xi)]. Consequently, PFormula [(Formula 1, Formula 2, ..., Formula k)] is proportional to


Formula

and this is maximised for any tree Formula that minimizes {sum}i = 1k βid(Formula i, Formula |Xi). This completes the proof.

Note that there may exist more than one ML supertree. This would be more likely if one has a small number of input trees and the distance metric is "coarse" (e.g., the symmetric difference metric) rather than "fine" (e.g., SPR: subtree-prune-and-regraft distance) and if the βi values are all equal.

In the special case where d is the nearest-neighbor interchange (NNI) metric, and the βi values are all equal, this ML supertree was described by Cotton and Wilkinson (2007). Moreover, in the consensus tree setting, and where d is the symmetric difference (Robinson-Foulds) metric, the consensus of the ML supertrees is the same as the usual majority-rule consensus tree. This follows from earlier results by Barthélemy and McMorris (1986; see also Cotton and Wilkinson, 2007).

The choice of metric d should be guided by the biological context and computational expediency—-for example, the SPR metric appropriately models horizontal gene transfer events, whereas the NNI metric may be appropriate for trees that lack local resolution; however, the symmetric difference metric may also be useful as it is fast to compute.

Note also that the βi values are not regarded as variables to be optimized in the ML procedure (to do so would lead to all trees being optimal solutions because βi = 0 would be the optimal selection for all i). Rather, they allow other external factors (e.g., how well supported each input tree is by the data) to be reflected in the model. As a default option (in the absence of such knowledge), one might set all the βi values equal.

The use of the exponential error model requires some explanation. In standard, character-based, or sequence-based phylogenetic analysis, the probability of obtaining the data is derived using a model of character or sequence substitution. This "process-based" approach to calculating the likelihood differs from the "error-based" approach we have adopted here. With an error-based approach, we model the probability distribution of outcomes, in this case the distribution of trees one may obtain from subsets of data. The value of this approach is that it is independent of process. The choice of an exponential distribution to describe the fall-off in probability of a tree as one moves further from the true tree is somewhat arbitrary but is due to simplicity and computational efficiency (Proposition 1). Regardless of the reasons why subtrees differ from the true supertree (i.e., regardless of the process), if the distribution of these differences can be modeled using an exponential distribution, our results hold.


    Statistical Consistency of ML Supertrees under the Exponential Model
 Top
 Abstract
 Terminology
 An Exponential Model of...
 Statistical Consistency of ML...
 Relation to MRP and...
 Statistical Consistency of ML...
 Discussion
 Appendix. Proofs of Main...
 References
 
Is the ML procedure statistically consistent as the number (k) of trees in the profile grows? More precisely, under what conditions is the method guaranteed to converge on the underlying generating tree Formula 0 as we add more trees to the analysis? The problem is slightly different from other settings (such as the consistency of ML for tree reconstruction from aligned sequence data) where one has a sequence of i.i.d. random variables. In the supertree setting, it is perhaps unrealistic to expect that the data sets are generated according to an identical process, since the sequence of subsets X1, X2,..., of X is generally deliberately selected.

To formalize the statistical consistency question in this setting, let X1, X2,..., be a sequence of subsets of finite set X. It is clear that the Xis must cover X in some reasonable way in order for the ML supertree method to be consistent—for example, if some taxon is not present in any Xi, or is present in only a small number of input trees, then we cannot expect the location of this taxon in any supertree to be strongly supported.

Thus, we will assume that the sequence of subsets of X satisfies the following covering property: For each subset Y of taxa from X of size m (where m = 3 for rooted trees or m = 4 for unrooted trees), the proportion of subsets Xi that contain Y does not decay to 0 as the sequence length (of subsets) increases. More formally, for each such subset Y of X we assume there is some {varepsilon} > 0 and some K sufficiently large for which:


Formula 2

(2)
If a subset of taxa, Y, is only found in one or a few trees and is never seen again in trees that are subsequently added, this property will not hold.

We now establish statistical consistency of ML supertrees under a much more general class of models than the exponential model. Consider any model M for generating phylogenetic trees (without branch lengths) on subsets of X that has a generating phylogenetic X tree Formula as its sole underlying parameter. Such a model will typically derive from a more complex model containing other parameters (such as branch lengths, population sizes, and so forth), but we will assume that these have a prior distribution and that they have been integrated out, so our model has just one parameter—the tree topology.

Given a sequence (X1, X2, ..., Xk...) of subsets of X we say that M satisfies the property of centrality if, for some {eta} > 0, and all i ≥ 1, we have:


Formula 3

(3)
for all trees Formula ' on leaf set Xi that are different from Formula |Xi. This condition says that, if Formula is the generating tree, then amongst all phylogenetic trees on leaf set Xi, the one that Formula induces on Xi is always (strictly) more probable than any other tree.

As a related condition, we say that M satisfies the property of basal centrality if, for some {eta} > 0, and all i ≥ 1, we have:


Formula 4

(4)
for all trees Formula ' on leaf set Y that are different from Formula |Y, and all Y subeqXi, of size m (= 3 for rooted trees and = 4 for unrooted trees). This condition says that, if Formula is the generating tree, then amongst all the small (m element) subsets Y of Xi the tree that Formula induces on Y is (strictly) more probable that any other tree.

Notice that the exponential model (with the βi values bounded away from 0) satisfies centrality (take {eta} = min i ≥ 1 [{alpha}i(1–e–βi)]). We will see later that basal centrality is a useful concept for another setting (lineage sorting). Note also that basal centrality (for a sequence Xi) is not a special case of centrality, and, conversely, centrality is not a special case of basal centrality, though of course the two notions coincide in the special case where |Xi| = m for all i.

We now show that the selection of an ML supertree is statistically consistent for any model M that satisfies either centrality or basal centrality when applied to a sequence of subsets of X that obeys the covering property. Its proof (along with the proofs of all following propositions and theorems) is given in the Appendix.

Theorem 2.Given a sequence X1,X2,..., which satisfies the covering property (2), consider a profile Formula k = (Formula 1, ..., Formula k), where Formula i is generated independently, from a tree Formula 0 with taxon set X, according to a model M that satisfies either the centrality or basal centrality property for this sequence. Then the probability that Formula k has a unique ML supertree and that this tree is Formula 0 tends to 1 as k -> {infty}.

Remarks

  • We can easily modify the ML process if some of the input trees are not fully resolved (due to "soft" polytomies). For a general phylogenetic tree ti (possibly with polytomies) on taxon set Xi subeqX, and a generating fully resolved phylogenetic tree Formula on taxon set X, let {varphi}(ti|Formula ) be the probability of the event that the tree Formula i that Formula generates under the exponential model is a refinement of ti. More precisely,


    Formula

    where Formula i ≥ ti indicates that the (fully resolved) tree Formula i contains all the splits present in ti and has the same leaf set (Xi). Notice that {varphi}(ti|Formula ) is not a probability distribution on phylogenetic trees with the leaf set Xi (its sum is > 1). Nevertheless, given a profile Formula = (t1, ..., tk) of phylogenetic trees (some or all of which may have polytomies), one can perform ML to select the tree Formula that maximizes the joint probability prodi = 1k {varphi}(ti|Formula ) of the events Formula i ≥ ti for i = 1,..., k.

  • We point out an alternative way of viewing this ML procedure applied to a profile Formula = (Formula 1, ..., Formula k) when d is one of two well-known metrics on trees (SPR and TBR). Suppose that we were to extend each tree Formula i in Formula to a tree Formula 'i on the full set of taxa (X). We could regard the placement of those taxa that are missing in Formula i (namely the taxa in XXi) to form a tree Formula i' on the full leaf set X to be "nuisance parameters" in a maximum likelihood framework (under the exponential model), and thereby seek to find the tree Formula and extensions (Formula 1', ..., Formula k') to maximize the joint probability:


    Formula

    We call such a tree Formula an extended ML tree for the profile Formula ; it turns out to be just the same as the ML trees we have defined above, as the next result shows.

Proposition 3.For d = SPR or d = TBR, and any profile Formula of fully resolved, unrooted phylogenetic trees, the extended ML tree(s) for Formula coincides precisely with the ML tree(s) for Formula .


    Relation to MRP and Its Statistical Inconsistency
 Top
 Abstract
 Terminology
 An Exponential Model of...
 Statistical Consistency of ML...
 Relation to MRP and...
 Statistical Consistency of ML...
 Discussion
 Appendix. Proofs of Main...
 References
 
In the MRP (matrix representation with parsimony) supertree method, the input trees are coded as characters—one for each interior edge of each tree—with character states 0, 1, ? depending, respectively, on whether the taxon in question is on one side, or the other, of the edge, or is absent from the tree. A most parsimonious tree (or trees) is then reconstructed from this data matrix. As shown recently by Bruen and Bryant (2007), there is a close analogy between MRP and consensus tree methods, which seek a median tree computed using the SPR (subtree prune and regraft) or TBR (tree bisection and reconnection) metric d (recall that a median tree for a profile Formula = (Formula 1, ... Formula k) of trees that all have same leaf set X, is a tree Formula that minimizes the sum {sum}i = 1kd(Formula , Formula i); cf. Proposition 1). However, the result from Bruen and Bryant (2007) does not guarantee that MRP produces an ML supertree even when βi = 1 for all i, as their approach constructs a median on a space of trees that are defined by splits rather than particular trees.

We turn now to the question of the statistical consistency of MRP under the exponential model (1). It can be shown that MRP will be statistically consistent under the covering property (2) in some special cases. Two such cases that can be formally established (details omitted) are (i) when all the subsets Xi are of size m (= 3 for rooted trees and = 4 for unrooted trees); or (ii) when βi is sufficiently large (in relation to |X|). However, in general, we have the following result.

Theorem 4.A β > 0 exists for which MRP is statistically inconsistent even in the special (consensus) case where, for all i, Xi is the same set of six taxa and βi = β. More precisely, for this value of β and with unrooted fully-resolved phylogenetic trees on these (equal) taxon sets, the probability that Formula 0 is an MRP tree (for a profile of trees generated under (1)) converges to 0 as k tends to infinity.

Remarks

  • The formal proof of Theorem 4 is given in the Appendix. However, the intuition behind the proof is that if β is sufficiently small (but still positive), then trees generated by the exponential model are close to a uniform random distribution on trees. For such input there is a slight bias of MRP towards unbalanced trees. This shape bias property of MRP has been explored (in related settings) by Wilkinson et al. (2005).
  • The exact condition (on the βis) at which inconsistency of MRP occurs is not clear, though some general comments can be made. For large values of βi, the decay of the exponential function is rapid, and the probability of producing an incorrect subtree some distance away from the true tree is small. In these cases, MRP is likely to work well. However, when βi is small, the probability of producing an incorrect subtree is relatively high, even when this subtree is some distance away from the true tree. An interesting theoretical question is whether a value s isin (0,1) exists for which MRP is statistically consistent (for arbitrarily large taxon sets) under the conditions of Theorem 4, whenever βi = β ≥ s.


    Statistical Consistency of ML Species Supertrees from Multiple Gene Trees
 Top
 Abstract
 Terminology
 An Exponential Model of...
 Statistical Consistency of ML...
 Relation to MRP and...
 Statistical Consistency of ML...
 Discussion
 Appendix. Proofs of Main...
 References
 
A current problem in phylogenetics is how best to infer species trees from gene trees (Degnan and Rosenberg, 2006; Gadagkar et al., 2005; Liu and Pearl, 2007). Even in the consensus setting (i.e., when the set of taxa for each gene tree is the complete set of taxa under study), Degnan and Rosenberg (2006) have demonstrated how incomplete lineage sorting on gene trees can mean that the most likely topology for a gene tree can differ from the underlying species tree (for certain rooted phylogenetic trees on four taxa and for all rooted phylogenetic trees on five or more taxa). This surprising result implies that simplistic majority-rule approaches to finding a consensus species tree can be problematic.

The phenomenon described by Degnan and Rosenberg (2006) is based on the coalescent model for studying lineage sorting in evolving populations. The surprising behavior arises only when the effective population sizes and the branch lengths of the species tree are in appropriate ranges. Moreover, for rooted three-taxon trees, the most probable gene tree topology always agrees with the species tree topology. Nevertheless, the fact that larger gene trees can favor an incorrect species tree might easily complicate some standard statistical approaches.

In this section, we show how, despite the phenomena described above (from Degnan and Rosenberg, 2006), and even in the more general supertree setting (where some gene trees may have some missing taxa), a maximum likelihood approach to supertree construction of a species tree from gene trees is statistically consistent.

Consider first a somewhat idealist situation where we have a sequence of rooted gene trees, each correctly inferred (but possibly differing from the species tree due to lineage sorting) for a sequence of (arbitrary size) subsets of X that satisfy the covering property. The statistical consistency of ML under this scenario follows immediately from Theorem 2 (using basal centrality, not centrality) because lineage sorting (regarded as an error model M for deriving gene trees from species trees under the coalescent model) satisfies basal centrality. This is because, under the coalescent model, PcFormula , Y[Formula |Y] = 1–2/3e{lambda}, whereas PcFormula ,Y[Formula ''] = 1/3e{lambda} for the two other choices of Formula '' != Formula |Y, where {lambda} is (up to a factor of 2) the ratio of time (measured in generations) to the effective population size (see, e.g., Rosenberg, 2002; Tajima, 1983). Consequently,


Formula

for all subsets Y of Xi of size 3. The limitation of this consistency result is that it invokes the unrealistic assumption that the gene trees have all been correctly inferred. We thus consider a more realistic scenario; however, in order to prove a consistency result, we must restrict the input trees to have just 3 leaves each. We thus consider the following situation:

Triplet-based supertree with dual error: a sequence of rooted gene trees inferred, possibly with error, on subset of X of size 3 that satisfy the covering property, provided the error in the reconstructed gene tree (differing from the true gene tree) is described by the exponential model.

The statistical consistency of ML supertrees under this scenario (Triplet-based supertree with dual error) is due to the following result (where M1 is the lineage sorting model that derives a gene tree from a species tree, whereas M2 models the error in reconstructing the gene tree correctly).

Proposition 5.Given a sequence X1,X2,..., of subsets of X of size 3 that satisfies the covering property (2), consider a profile Formula k = (Formula 1, ..., Formula k), where tree Formula i is generated independently, from a tree Formula 0 with taxon set X, by first generating a tree on taxon set Xi under some central model, M1, and then, independently, using this tree to generate Formula i according to an exponential model, M2. Then the probability that Formula k has a unique ML supertree and that this tree is Formula 0 tends to 1 as k -> {infty}.


    Discussion
 Top
 Abstract
 Terminology
 An Exponential Model of...
 Statistical Consistency of ML...
 Relation to MRP and...
 Statistical Consistency of ML...
 Discussion
 Appendix. Proofs of Main...
 References
 
To develop a likelihood-based supertree reconstruction method, it is necessary to define a model that delivers the probability of obtaining a series of subtree topologies, given a hypothesized supertree. We have chosen a very simple yet intuitive probability function whereby the probability of observing a wrong subtree (i.e., one where the topology differs from that of a pruned supertree) decreases exponentially as its topology becomes increasingly distant from that of the hypothesised supertree. Consequently, the ML supertree can be estimated even when the constituent subtrees have conflicting topological signals.

Our approach is model based, but one may reasonably ask whether the model described here is a biologically realistic one. We suggest that it is. For one thing, we expect, for a variety of reasons, to see conflicts between the topologies of subtrees and the reconstructed supertree. With gene sequences obtained from different species, for instance, incomplete lineage sorting and ancestral heterozygosity frequently lead to differences between gene trees and species trees. Convergent and parallel evolution can confound phylogenetic reconstruction, as can long-branch attraction. We have chosen to use the exponential distribution to describe this steady decrease in probabilities as the distances between subtrees and supertrees increase. The value of using the exponential distribution lies in the ease with which it can be manipulated when we compute log-likelihoods. Additionally, we have noted that our model is an error-based model; i.e., it describes the distribution of subtree-to-supertree distances without regard for the underlying processes that cause topological conflicts. However, we suggest that one fruitful research project may be to explore other possible probability distributions, other tree-to-tree distance metrics, as well as process-based models of topological conflict. In this respect, coalescent-based models of incomplete lineage sorting (Carstens and Knowles, 2007) may hold some promise. As with the original phylogenetic likelihood methods designed for character and sequence data, we hope to see new models emerge as other researchers explore the use of ML supertrees. Even if one is reluctant to use the exponential distribution, the statistical consistency of ML supertrees is nonetheless guaranteed, provided the probability distribution of subtree-to-supertree distances assigns the highest probability to a distance of zero.

The likelihood framework provides an additional benefit: a rich body of statistical and phylogenetic methods already use likelihood. Moreover, statistical consistency holds for maximum likelihood supertrees under weak conditions, in contrast to MRP, which can be inconsistent in some cases. We also show that the ML supertree approach developed here provides a statistically consistent strategy for combining gene trees even when there is the possibility that these trees may be different from the true species tree. An obvious application of ML supertrees will be their use in statistical tests of topological hypotheses, and we already know how to do this with standard ML phylogenies (Goldman et al., 2000).

We also recognize that our particular likelihood implementation is closely related to the Majority-Rule(–) Supertree construction proposed by Cotton and Wilkinson (2007). More precisely, when the tree metric is the symmetric difference (Robinson-Foulds) metric, then the Marjority-Rule(–) Supertree is, in effect, the strict consensus of our ML supertrees. However, the approach in Cotton and Wilkinson (2007) is quite different: they show how to extend majority rule from the consensus to the supertree setting. Nonetheless, they converge on the same optimality criterion that we use; i.e., a supertree that minimizes the sum of distances to a set of trees. One should not be surprised that the same optimality criterion can emerge from different conceptual bases. With standard phylogenetic reconstruction, choosing the tree that minimizes the number of evolutionary changes can be justified philosophically (with the principle of maximum parsimony) as a consensus method (Bruen and Bryant, 2007) or by using an explicitly statistical approach such as likelihood (Steel and Penny, 2000).

We have not discussed algorithms to search for ML supertrees. As with classical phylogenetic likelihood methods, maximum likelihood supertrees will be obtained using a variety of approaches, including heuristic methods, genetic algorithms, simulated annealing, and so on. We also direct readers to the discussion in Cotton and Wilkinson (2007), because the criterion we use is similar to theirs.


    Appendix. Proofs of Main Results
 Top
 Abstract
 Terminology
 An Exponential Model of...
 Statistical Consistency of ML...
 Relation to MRP and...
 Statistical Consistency of ML...
 Discussion
 Appendix. Proofs of Main...
 References
 
In this section we collect together the proofs of the main results. We start by stating a general sufficient condition for the consistency of ML in non-i.i.d. settings.

Consistency of ML for General (non-i.i.d.) Sequences
Here we describe a convenient way to establish the statistical consistency of maximum likelihood when we have a sequence of observations that may not be independent or identically distributed. We frame this discussion generally, as the result may be useful for other problems. In particular, in this result, we do not need to assume the sequence samples are independent (though in our applications, they are), nor identically distributed (in our applications, they are not). Suppose we have a sequence of random variables Y1, Y2, ..., that takes values in some finite set W and are generated by some process that depends on an underlying discrete parameter a that can take values in some finite set A. In our setting, the Yis are trees constructed from different data sets (e.g., gene trees), whereas a is the generating species tree topology. We assume that the model specifies the probability distribution of (Y1,..., Yk) given (just) a—for example, in our tree setting this would mean specifying prior distributions on the branch lengths and other parameters of interest (e.g., ancestral population sizes) and integrating with respect to these priors.

Given an actual sequence (y1,..., yk) of observations, the maximum likelihood (ML) estimate of the discrete parameter is the value a that maximizes the joint probability


Formula

(i.e., the probability that the process with parameter a generates (y1, ..., yk)). Now suppose that the sequence Y1,... Yk, ..., is generated by a0. We would like the probability that the ML estimate is equal to a0 to converge to 1 as k increases. If this holds for all choice of a0 isin A, then ML is statistically consistent. The following result provides a convenient way to establish this; indeed, it characterises the statistical consistency of ML.

Proposition 6.In the general setup described above, ML is statistically consistent if and only if the following condition holds: for any two distinct elements a,bisin A, we can construct a sequence of events E1, E2,..., where Ek is dependent on (Y1, ... Yk), for which, as k -> {infty}:

  1. the probability of Ek under the model with parameter a converges to 1.
  2. the probability of Ek under the model with parameter b converges to 0.

Proof. The "only if" direction is easy: Suppose ML is statistically consistent and a, b isin A are distinct. Let Ek be the event that a is the unique maximum likelihood estimate obtained from (Y1, ..., Yk). Then Ek satisfies conditions (i) and (ii).

For the converse direction, recall that the variation distance between two probability distributions p,q on any finite set W is


Formula

where Pp(E) = {sum}w isin Ep(w) is the probability of event E under distribution p (similarly for Pq(E)). This variation distance can also be written as 1/2||pq||1, where ||pq||1 = {sum}w isin W|p(w) – q(w)| is the l1 distance between p and q. Thus, if we let d(k)(a,b) denote the l1 distance between the probability distribution on (Y1, ..., Yk) induced by a and by b, then conditions (i) and (ii) imply that


Formula 5

(5)
Now, by the first part (equation 3.1) of Theorem 3.2 of Steel and Szekely (2002), the probability that the ML estimate is the value of A that generates the sequence (Y1, ..., Yk) is at least 1 – {sum}b != a [1– 1/2d(k)(a,b)] and so, by (5), this probability converges to 1 as k -> {infty}.

Proof of Theorem 2. To establish the theorem, using Proposition 6 (stated above) it is enough to specify for each choice of distinct resolved phylogenetic X-trees Formula 0 and Formula , a sequence of events Ek (dependent on Formula k) for which, as k grows, Ek has a probability that tends to 1 under the distribution obtained from Formula 0 and tends to 0 under the distribution obtained from Formula . Since Formula differs from Formula 0 a subset Y exists of size m (= 3 for rooted trees and = 4 for unrooted) for which Formula |Y != Formula 0|Y. Notice that the covering property (2) implies that


Formula 6

(6)

Let Ek be the event that among all those i isin {1, ..., k} for which Formula |Xi != Formula 0|Xi, we have Formula i = Formula 0|Xi more often than Formula i = Formula |Xi. Now, for a profile generated by Formula 0 according to the model satisfying the centrality property, we have, for each i for which Formula |Xi != Formula 0|Xi,


Formula 7

(7)
Similarly, for a profile generated by Formula according to a model satisfying the centrality property, and for each i for which Formula |Xi != Formula 0|Xi, we have


Formula 8

(8)

By condition (6), there is a positive limiting proportion ({varepsilon} > 0) of i for which Formula |Xi != Formula 0|Xi. By independence (and the law of large numbers) it follows from inequality (7) that event Ek has a probability that tends to 1 as k -> {infty} for a profile generated by Formula 0. Similarly, by (8), event Ek has a probability tending to 1 as k -> {infty} for a profile generated by Formula . Statistical consistency of ML now follows by Proposition 6.

For a model M satisfying basal centrality the same argument applies if we select Y as above and modify event Ek to be the event that among all those i isin {1, ..., k} for which Y subeqXi we have Formula i|Y = Formula 0|Y more often than Formula i = Formula |Y.

Proof of Theorem 4. For two unrooted fully-resolved phylogenetic X-trees Formula , Formula ', let L(Formula ,Formula ') denote the total parsimony score on Formula ' of the set of splits of Formula . That is,


Formula 9

(9)
where {Sigma}(Formula ) is the set of splits of Formula and l({sigma}, Formula ') is the parsimony score of the split {sigma} on Formula ' (treating {sigma} as a binary character; Semple and Steel, 2003). For any fully resolved phylogenetic X tree Formula ', let e(Formula '; Formula 0) be the expected total parsimony score on Formula ' of the set of splits of a tree Formula randomly generated by Formula 0 according to the exponential model (1). Then,


Formula 10

(10)
To establish Theorem 4, it is enough to show, for some β > 0 and for two unrooted fully resolved trees Formula 0, Formula 1 on X = {1, ..., 6}, that e(Formula 0; Formula 0) e(Formula 1; Formula 0) > 0, because if Formula 0 is the generating tree, then Formula 1 will be favored over Formula 0 by MRP. We first show that this can occur when β = 0. In that case, {alpha}exp [–β d(Formula , Formula 0)] = 1/105 for all Formula (there are 105 unrooted fully-resolved phylogenetic trees on X) and so, by (10), we have


Formula

Applying (9) and interchanging the order of summation gives:


Formula 11

(11)
where n({sigma}) is the number of unrooted fully resolved phylogenetic X trees containing split {sigma} and the summation is over all the splits of X = {1,..., 6}. Moreover, if any difference l({sigma}, Formula 0)– l({sigma}, Formula 1) is non-zero in (11), then {sigma} is necessarily a split that partitions the taxa into sets of sizes either 2,4 or 3,3, and for such a split {sigma} we have n({sigma}) = 15 and 9, respectively.

Now suppose Formula 0 has a symmetric shape (i.e., an unrooted fully resolved tree of six leaves with three cherries) and Formula 1 has a pectinate shape (i.e., an unrooted fully resolved tree of six leaves with two cherries). Then, by using earlier results (Steel et al., 1992, table 3) concerning the number of splits that partition the taxa into sets of size 2,4 and 3,3 and have parsimony score 1,2,3 in these two trees, it can be shown from (11) that


Formula

So far, we have assumed that β = 0; however, e(Formula 0; Formula 0) e(Formula 1; Formula 0) is a continuous function of β, so a strictly positive value of β exists for which


Formula

This completes the proof.

Proof of Proposition 3. For d = SPR or d = TBR, and for any resolved unrooted phylogenetic trees Formula on taxon set X, and Formula Y on taxon set Y subeqX we claim that:


Formula 12

(12)
where Formula ' ranges over the set of all unrooted resolved phylogenetic tree on taxon set X that induce Formula Y when restricted to Y. To establish this claim, note firstly that, for any Formula ' with taxon set X, and Y subeqX, we have d(Formula '|Y, Formula |Y) ≤ d(Formula ', Formula ) and so the ≥ inequality holds in (12). To establish equality induction on k (starting with the base case k = 1) shows that if Formula Y and Formula |Y are k SPR (or TBR) moves apart, then there exists an unrooted resolved phylogenetic Formula ' on taxon set X that induces Formula Y when restricted to Y, and which is at most k SPR (or TBR, respectively) moves apart from Formula . Having established (12), Proposition 3 now follows by Proposition 1.

Note that Equation (12) does not necessarily hold for other tree metrics such as the NNI (nearest-neighbor interchange) or the partition (Robinson-Foulds) metric.

Proof of Proposition 5. By Proposition 2 it suffices to show that the composite model satisfies centrality. Given a rooted phylogenetic tree Formula on taxon set X, and a subset Y of X of size 3, label the three rooted binary trees on Y, Formula 1, Formula 2, Formula 3, so that Formula 1 = Formula |Y. We need to show that, for some {eta}' > 0, PFormula ,Y(Formula 1) ≥ PFormula ,Y(Formula j) + {eta}' for j = 2,3. By the independence assumption,


Formula

where PFormula ,Y(1)(Formula j) is the probability of generating Formula j under model M1 from generating tree Formula , and PFormula j, Y(2)(Formula k) is the probability of generating Formula k under model M2 with generating tree Formula j. Now, PFormula j, Y(2)(Formula k) takes the value {alpha} for j = k, and the value {alpha} e–β for j != k. Thus, PFormula ,Y(Formula 1) = {alpha} PFormula ,Y(1)(Formula 1) + {alpha} e–β[PFormula ,Y(1)(Formula 2) + PFormula ,Y(1)(Formula 3)], whereas PFormula ,Y(Formula 2) = {alpha} PFormula ,Y(1)(Formula 2) + {alpha} e–β[PFormula ,Y(1)(Formula 1) + PFormula ,Y(1)(Formula 3)]. Consequently,


Formula

A similar expression holds for other difference PFormula ,Y(Formula 1) PFormula ,Y(Formula 3) and because M1 is central, we can take {eta}' to be the minimal value of {alpha} (1–e–β)[PFormula ,Y(Formula 1) PFormula ,Y(Formula k)] over k = 2,3.


    Acknowledgements
 
We thank the Allan Wilson Centre for Molecular Ecology and Evolution for supporting this work. Allen Rodrigo began this project while he was working with Olivier Gascuel at the Laboratoire d'Informatique, de Robotique et de Microelectronique de Montpellier.


    References
 Top
 Abstract
 Terminology
 An Exponential Model of...
 Statistical Consistency of ML...
 Relation to MRP and...
 Statistical Consistency of ML...
 Discussion
 Appendix. Proofs of Main...
 References
 

    Barthélemy J. P., McMorris F. R. The median procedure for n-trees. J. Classif. (1986) 3:329–334.[CrossRef]

    Baum B. R., Ragan M. A. Combining trees as a way of combining data sets for phylogenetic inference, and the desirability of combining gene trees. Taxon (1992) 41:3–10.[CrossRef][Web of Science]

    Bininda-Emonds O. R. P., Cardillo M., Jones K. E., MacPhee R. D. E., Beck R. M. D., Grenyer R., Price S. A., Vos R. A., Gittleman J. L., Purvis A. The delayed rise of present-day mammals. Nature (2007) 446:507–512.[CrossRef][Medline]

    Bininda-Emonds O., Sanderson M. J. Assessment of the accuracy of matrix representation with parsimony analysis supertree construction. Syst. Biol. (2001) 50:565–579.[Abstract/Free Full Text]

    Bruen T., Bryant D. Parsimony as consensus. Syst. Biol. (2007) In press.

    Carstens B., Knowles L. Estimating phylogeny from gene tree probabilities in melanoplus grasshoppers despite incomplete lineage sorting. Syst. Biol. (2007) 56:400–411.[Abstract/Free Full Text]

    Cotton J. A., Wilkinson M. Majority-rule supertrees. Syst. Biol. (2007) 56:445–452.[Abstract/Free Full Text]

    Degnan J. H., Rosenberg N. A. Discordance of species trees with their most likely gene trees. PLoS Genet. (2006) 2(5):e68.[CrossRef][Medline]

    Gadagkar S. R., Rosenberg M. S., Kumar S. Inferring species phylogenies from multiple genes: Concatenated sequence tree versus consensus gene tree. J. Exp. Zool. (2005) 304B:64–74.

    Goldman N., Anderson J. P., Rodrigo A. G. Likelihood-based tests of topologies in phylogenetics. Syst. Biol. (2000) 49:652–670.[Abstract/Free Full Text]

    Holmes S. Statistics for phylogenetic trees. Theor. Pop. Biol. (2003) 63:17–32.[CrossRef][Web of Science][Medline]

    Liu L., Pearl D. K. Species trees from gene trees: Reconstructing Bayesian posterior distributions of a species phylogeny using estimated gene tree distributions. Syst. Biol. (2007) 56:504–514.[Abstract/Free Full Text]

    McMorris F. R. The median procedure for n-trees as a maximum likelihood method. J. Classif. (1990) 7:77–80.[CrossRef]

    Rodrigo A. G. On combining cladograms. Taxon (1996) 45:267–274.[CrossRef][Web of Science]

    Ronquist F., Huelsenbeck J. J., Britton T. Bayesian supertrees. In: Phylogenetic supertrees—Bininda-Emonds O. R. P., ed. (2004) Dordrecht, The Netherlands: Kluwer Academic Publishers. Chapter 9.

    Rosenberg N. A. The probability of topological concordance of gene trees and species trees. Theor. Pop. Biol. (2002) 61:225–247.[CrossRef][Web of Science][Medline]

    Ross H. A., Rodrigo A. G. An assessment of matrix representation with compatibility in supertree reconstruction. In: Phylogenetic supertrees—Bininda-Emonds O. R. P., ed. (2004) Dordrecht, The Netherlands: Kluwer Academic Publishers. Chapter 2.

    Semple C., Steel M. Phylogenetics (2003) Oxford University Press.

    Steel M., Penny D. Parsimony, likelihood and the role of models in molecular phylogenetics. Mol. Biol. Evol. (2000) 17:839–850.[Abstract/Free Full Text]

    Steel M., Szekely L. A. Inverting random functions (ii): Explicit bounds for discrete maximum likelihood estimation, with applications. SIAM J. Discr. Math. (2002) 15:562–575.[CrossRef]

    Steel M. A., Hendy M. D., Penny D. Significance of the length of the shortest tree. J. Classif. (1992) 9:71–90.[CrossRef]

    Tajima F. Evolutionary relationships of DNA sequences in finite populations. Genetics (1983) 105:437–460.[Abstract/Free Full Text]

    Wilkinson M., Cotton J. A., Creevey C., Eulenstein O., Harris S. R., Lapointe F.-J., Levasseur C., Mcinerney J. O., Pisani D., Thorley J. L. The shape of supertrees to come: Tree shape related properties of fourteen supertree methods. Syst. Biol (2005) 54:419–431.[Abstract/Free Full Text]

    Wilkinson M., Thorley J. L., Littlewood D. T. L., Bray R. A. Towards a phylogenetic supertree for Platyhelminthes? In: Interrelationships of the Platyhelminthes—Littlewood D. T. L., Bray R. A., eds. (2001) London, UK: Taylor and Francis. 292–301.


Add to CiteULike CiteULike   Add to Connotea Connotea   Add to Del.icio.us Del.icio.us    What's this?


This article has been cited by other articles:


Home page
Mol Biol EvolHome page
S. Simon, S. Strauss, A. von Haeseler, and H. Hadrys
A Phylogenomic Approach to Resolve the Basal Pterygote Divergence
Mol. Biol. Evol., December 1, 2009; 26(12): 2719 - 2730.
[Abstract] [Full Text] [PDF]


Home page
Syst BiolHome page
L. Liu, L. Yu, D. K. Pearl, and S. V. Edwards
Estimating Species Phylogenies Using Coalescence Times among Sequences
Syst Biol, October 1, 2009; 58(5): 468 - 477.
[Abstract] [Full Text] [PDF]


Home page
Syst BiolHome page
J. Dong and D. Fernandez-Baca
Properties of Majority-Rule Supertrees
Syst Biol, July 3, 2009; (2009) syp032v1.
[Full Text] [PDF]


Home page
Am. J. Bot.Home page
R. Torices and A. A. Anderberg
Phylogenetic analysis of sexual systems in Inuleae (Asteraceae)
Am. J. Botany, May 1, 2009; 96(5): 1011 - 1019.
[Abstract] [Full Text] [PDF]


Home page
Syst BiolHome page
M. T. Holder, J. Sukumaran, and P. O. Lewis
A Justification for Reporting the Majority-Rule Consensus Tree in Bayesian Phylogenetics
Syst Biol, October 1, 2008; 57(5): 814 - 821.
[Full Text] [PDF]


This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (PDF) Freely available
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in ISI Web of Science
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrow Search for citing articles in:
ISI Web of Science (3)
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Steel, M.
Right arrow Articles by Rodrigo, A.
Right arrow Search for Related Content
PubMed
Right arrow Articles by Steel, M.
Right arrow Articles by Rodrigo, A.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?