Skip Navigation

Systematic Biology 2007 56(5):798-817; doi:10.1080/10635150701639754
This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (PDF) Freely available
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in ISI Web of Science
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrow Search for citing articles in:
ISI Web of Science (4)
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Ranwez, V.
Right arrow Articles by Douzery, E. J. P.
Right arrow Search for Related Content
PubMed
Right arrow Articles by Ranwez, V.
Right arrow Articles by Douzery, E. J. P.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?

© 2007 Society of Systematic Biologists

PhySIC: A Veto Supertree Method with Desirable Properties

Edited by Olaf Bininda-Emonds

Vincent Ranwez1, Vincent Berry2, Alexis Criscuolo1,2, Pierre-Henri Fabre1, Sylvain Guillemot2, Celine Scornavacca1,2 and Emmanuel J. P. Douzery1

1 Institut des Sciences de l'Evolution (ISEM, UMR 5554 CNRS), Université Montpellier II Place E. Bataillon, CC 064, 34095, Montpellier, Cedex 5, France E-mail: ranwez{at}isem.univ-montp2.fr
2 Laboratoire d'Informatique, de Robotique et de Microélectronique de Montpellier (LIRMM,UMR 5506, CNRS), Université Montpellier II 161 rue Ada, 34392, Montpellier, Cedex 5, France


    Abstract
 Top
 Abstract
 Non-Contradiction and Induction...
 PhySIC: A Polynomial-Time Veto...
 Biological Case Studies on...
 Conclusion
 Appendix 1
 References
 
This paper focuses on veto supertree methods; i.e., methods that aim at producing a conservative synthesis of the relationships agreed upon by all source trees. We propose desirable properties that a supertree should satisfy in this framework, namely the non-contradiction property (PC) and the induction property (PI). The former requires that the supertree does not contain relationships that contradict one or a combination of the source topologies, whereas the latter requires that all topological information contained in the supertree is present in a source tree or collectively induced by several source trees. We provide simple examples to illustrate their relevance and that allow a comparison with previously advocated properties. We show that these properties can be checked in polynomial time for any given rooted supertree. Moreover, we introduce the PhySIC method (PHYlogenetic Signal with Induction and non-Contradiction). For k input trees spanning a set of n taxa, this method produces a supertree that satisfies the above-mentioned properties in O(kn3 + n4) computing time. The polytomies of the produced supertree are also tagged by labels indicating areas of conflict as well as those with insufficient overlap. As a whole, PhySIC enables the user to quickly summarize consensual information of a set of trees and localize groups of taxa for which the data require consolidation. Lastly, we illustrate the behaviour of PhySIC on primate data sets of various sizes, and propose a supertree covering 95% of all primate extant genera. The PhySIC algorithm is available at http://atgc.lirmm.fr/cgi-bin/PhySIC.

Keywords: Formal properties; phylogenetics; polynomial-time algorithms; primates; software; supertrees; triplets; veto methods

Received August 21, 2006; Revised November 23, 2006; Accepted June 8, 2007


Building Supertrees
Phylogenies are invaluable tools in various areas of biology to understand the evolution of genes and taxa. Trees that incorporate an exhaustive sampling of taxonomic biodiversity provide crucial information about systematics, genomics, and diversification patterns of species (e.g., Davies et al., 2004). Large trees can be built using various approaches, including supermatrices and supertrees. The former approach consists of combining the different source data sets into a supermatrix of characters and then analyzing it under standard phylogenetic reconstruction criteria (e.g., Delsuc et al., 2005). The supertree approach is an alternative methodology using trees rather than character data as a primary source of information. It first involves inferring partially overlapping, source phylogenetic trees from initial character data and then assembling them into a larger, more comprehensive supertree (Bininda-Emonds, 2004a). This approach is particularly convenient when dealing with heterogeneous character sources; e.g., those scored from morphological, transposable elements, DNA, or protein studies. Supertrees have become increasingly popular (e.g., Bininda-Emonds, 2004b), notably since the seminal work involving the reconstruction of the primate supertree (Purvis, 1995a). The widespread use of supertrees is explained by three useful applications (Wilkinson et al., 2004): (i) they provide large phylogenetic frameworks for broad comparative studies; (ii) they evaluate the congruence of sets of input trees, and reveal conflicts due to outlier/unstable taxa; and (iii) they identify insufficient overlap among leaf sets of input trees and assign priorities for choosing the taxa to be subsequently sampled.

Different Kinds of Supertree Methods
Supertree methods fall into three categories depending on their way of handling topological conflicts; i.e., different arrangements of the same leaves among labeled source trees.

The first suite of methods does not handle incompatible source trees. The pioneering methods that belong to this category are Build (Aho et al., 1981) and the strict consensus supertree (Gordon, 1986). Although they are important milestones, these methods appear "of limited use. As most systematists know, phylogenies usually conflict with one another" (Bininda-Emonds, 2004b:4).

The second suite of methods handles conflicts among input trees in a liberal way: they apply a voting procedure. In order to extract their main phylogenetic signal, source trees are asked to vote on various parts of the phylogeny to be inferred, with the most supported candidates being elected and composing the output supertree. Voting methods are said to resolve conflicts (Thorley and Wilkinson, 2003): for each conflict, they use some optimization criterion to make a decision in favor of one of the topological alternatives. Most conflicts among input trees are expected to be resolved because relationships displayed by the supertree are guided by source topologies on the basis of weight of evidence. The most widespread voting method is matrix representation with parsimony (MRP) whereby nodes of each source tree are encoded as binary characters of a matrix, which is then analyzed with the maximum parsimony criterion to obtain the composite tree (Baum 1992; Ragan, 1992). Analyzing this binary encoding of source topological information with other tree-building criteria leads to variants of MRP such as matrix representation with flipping (MRF; Chen et al., 2003) and matrix representation with compatibility (MRC; Ross and Rodrigo, 2004). Other methods of the voting kind, such as MinCut (MC; Semple and Steel, 2000) and ModifiedMinCut (MMC; Page, 2002) extend Build. They encode source trees in a graph that is progressively decomposed to get supertree clades. When conflicts hinder the decomposition, the graph is cut by removing the least supported relationships. The Average Consensus Supertree (Lapointe and Cucumel, 1997) and super distance matrix (Criscuolo et al., 2006) methods implement the voting approach in an alternative way. They average the initial distance matrices, converted from source characters or valued topologies, into a superdistance matrix; a tree-building distance-based approach is then used to infer a supertree from this matrix. Interestingly, voting methods like MRP may generate novel clades; i.e., clades not present in any input tree alone (Purvis, 1995b; Bininda-Emonds and Bryant, 1998; Sanderson et al., 1998). Unfortunately, when source trees conflict, novel clades that are contradicted by each of the source trees can be present in the supertree inferred by MRP (Goloboff and Pol, 2002; Goloboff, 2005; Cotton et al., 2006) and by MRF (Goloboff, 2005). The importance of this phenomenon is still debated, Bininda-Emonds (2003) reporting, on the basis of simulations, that this situation is not very frequent for MRP, whereas Goloboff (2005) shows selected case studies where "this situation is, clearly, not very unlikely."

The third suite of methods handles conflicts among input trees in a conservative way. They adopt a veto philosophy: the phylogenetic information of every source topology is to be respected, and the supertree is not allowed to contain clades that a source tree would vote against. These methods remove conflicts (Thorley and Wilkinson, 2003) because they either propose multifurcations in the supertree (Goloboff and Pol, 2002) or prune rogue taxa (Berry and Nicolas, 2004). In this framework, the supertree should not retain a single branching pattern within a given clade when several valid topological alternatives are present in the source trees. The full agreement required by veto methods provides an unambiguous phylogenetic framework that is, for instance, well suited for taxonomic revisions. More specifically, such a conservative approach may be applied to automatically build or update parts of the Tree of Life (http://tolweb.org). Several supertree methods akin to the veto philosophy have been proposed, all of which are inspired by consensus approaches that operate on trees with identical leaf sets. For example, extensions of the strict consensus (Gordon, 1986; Huson et al., 1999), semi-strict consensus (Goloboff and Pol, 2002), and maximum agreement subtree consensus (Berry and Nicolas, 2004) have been proposed to infer veto supertrees.

Properties of Supertree Methods
To assess the relevance of supertree methods, it is most useful to have properties characterizing the extent to which the supertrees they infer are reliable syntheses of source trees (Bininda-Emonds and Bryant, 1998; Steel et al., 2000;Wilkinson et al., 2004; Goloboff, 2005). For instance, Steel et al. (2000) suggest that the output supertree should (i) encompass every source tree when possible, (ii) always contain every leaf (taxon) that occurs in at least one source tree, and (iii) be computed under a running time that grows polynomially with respect to the total number of leaves. These authors also showed that rooted input trees are more appealing than unrooted ones for supertree methods that aim to satisfy several desirable properties simultaneously. Yet, even if supertree methods satisfy some desirable properties, the inferred supertrees often contain polytomies that actually intermix two distinct phenomena: either a lack of overlap in the topological information among source trees, or the occurrence of topological conflicts among them, or a combination of these. We thus decided to develop a method that proposes supertrees with unambiguous resolutions, and provides biologists with explanations about causes of polytomies. For this purpose, we rely on two new formal properties.

On the one hand, we think that supertree methods should avoid arbitrary resolutions; i.e., resolutions that are not entailed by the source topologies. Indeed, novel relationships displayed by a supertree "are worrying if they are not implied by combinations of the input trees" (Wilkinson et al., 2005), and "should be identified as such, to highlight their lack of any known justification" (Pisani and Wilkinson, 2002). Thus, we first request that every piece of phylogenetic information displayed in the supertree be present in one or several source topologies or be induced by their interaction; we call this the induction property.

On the other hand, we focus on unanimous clades, thus adopting a veto point of view. This means that the supertree is not allowed to contain a clade that conflicts either directly with a source tree or indirectly with a combination of them. We call this the non-contradiction property. Such a supertree, which incorporates only uncontradicted input relationships, provides a reliable baseline for subsequent analyses (Goloboff and Pol, 2002; Goloboff, 2005).

Goloboff and Pol (2002) mentioned similar properties in a formal characterization involving triplets. They provide examples showing that supertree methods of the voting kind, such as MRP and MC, understandably do not respect these properties. Although being appealing, the characterization proposed by Goloboff and Pol (2002) can at times be too restrictive or permissive (see following sections). Recently, Grunewald et al. (2006) provided another characterization of a property related to arbitrary resolutions contained in the supertree with respect to source trees. In both cases, there does not seem to be any straightforward algorithm that would always allow for property verification. Note, however, that Goloboff and Pol (2002) proposed a supertree heuristic algorithm that satisfies the desired properties in most cases.

In this paper, we provide a characterization of non-contradiction and induction properties that differs from those of Goloboff and Pol (2002) and Grunewald et al. (2006). We also describe simple and polynomial-time algorithms that enable users to check whether or not a given supertree satisfies these properties. Then we propose an algorithm called PhySIC that improves the Build algorithm (Aho et al., 1981) by always inferring a supertree and which, moreover, satisfies the non-contradiction and induction properties. As far as we know, this is the first time that a polynomial-time method is proposed that always satisfies properties related to induction and non-contradiction. Moreover, improving the behavior of Build with respect to arbitrary decisions can benefit the various methods that extend this algorithm for supertree purposes; e.g., MC (Semple and Steel, 2000), MCC (Page, 2000), AncestralBuild (Daniel and Semple, 2004; Berry and Semple, 2006), and RankedTree (Bryant et al., 2004). Next, we pinpoint the difference between the behavior of PhySIC and that of well-known supertree methods on a biological case study on Primates. Lastly, we illustrate PhySIC on the reconstruction of the primate supertree at the genus level from various source trees based on mitochondrial DNA, nuclear DNA, and jumping gene sequences. The supertree reconstructed appears to be useful for displaying phylogenetic relationships among the major primate taxa (Goodman et al., 2005). Moreover, the produced supertree displays label(s) on each of its polytomous nodes that identify the cause(s) of these polytomies (lack of cross-information and/or presence of contradictions). The PhySIC method has been implemented in C++ using the Bio++ library (Dutheil et al., 2006) and is freely available as a web service and for download at http://atgc.lirmm.fr/cgi-bin/PhySIC/physic.cgi.


    Non-Contradiction and Induction Properties
 Top
 Abstract
 Non-Contradiction and Induction...
 PhySIC: A Polynomial-Time Veto...
 Biological Case Studies on...
 Conclusion
 Appendix 1
 References
 
We first introduce vocabulary and notations required to formally define the properties of non-contradiction (PC) and of induction (PI). Simple examples are then used to illustrate the relevance of PC and PI as well as to relate them with previously proposed properties for supertree methods (Steel et al., 2000; Goloboff and Pol, 2002). Then we show how to check in polynomial time whether a supertree satisfies PC and PI for a given collection of source trees.

Topological Description of Trees
The definitions and notations used for trees and their topological description are mainly the same as those used by Semple and Steel (2003). We only consider rooted phylogenies, due to the fact that supertree methods cannot fulfill different desirable properties listed in Steel et al.(2000) when considering unrooted trees. Hereafter, the terms phylogeny and tree are considered synonymous. Given a tree T, L(T) denotes the set of taxa associated to its leaves. More generally, given a collection {tau} of trees, L({tau}) denotes the set of taxa appearing in at least one tree of {tau}. Given two phylogenies T and T' on the same leaf set (L(T) = L(T')), we say that T refines T' whenever T contains all clades of T'. In other words, either T and T' are identical or T can be transformed into T' by collapsing some of its internal edges.

A rooted tree on three leaves A,B,C has only three possible binary shapes, called triplets and denoted by AB|C, respectively AC|B, respectively BC|A, depending on the innermost clade (AB, respectively AC, respectively BC). Given a triplet t, Formula denotes any of the two other triplets on the same set of leaves. Alternatively, a tree on three leaves can be a star tree; i.e., a unique internal node connected to the leaves. Any rooted tree T can be equivalently described by a set of triplets: homeomorphic to subtrees of T connecting three leaves. (e.g., Grunewald et al., 2006); rt(T) denotes this set. Given a collection {tau} of phylogenies, rt({tau}) = {cup}Tiisin{tau}rt(Ti) denotes the set of triplets present in these phylogenies. Note that it is possible that rt({tau}) contains two triplets t and Formula , namely when {tau} hosts two incompatible phylogenies. Clearly, two such triplets cannot be combined into a single supertree of the collection.

Given a set R of triplets, L(R) denotes the set of taxa appearing in at least one tree in R. A tree T is said to display a set R of triplets when R{subseteq} rt(T); moreover, Tstrictly displays R if additionally L(T) = L(R). A set R of triplets is compatible if there is a tree T that displays R. To find a tree displaying R, it is useful to take into account that some triplets of the tree are induced by R: a compatible set R of triplets induces a triplet t, denoted by R{vdash} t, if and only if R{cup}{ Formula } is not compatible, or equivalently if any tree T that displays R contains t. For instance, any tree displaying {AB|C,BC|D} also has to display the triplet AC|D; i.e., {AB|C,BC|D}{vdash} AC|D. Bandelt and Dress (1986) and Dekker (1986) were among the first to investigate such induction rules. The set of all triplets induced by a compatible set R is called the closure of R and is denoted by cl(R). Source trees considered for supertree building are sometimes incompatible, and then the set of triplets considered is incompatible. Nonetheless, we can characterize the set of triplets induced by these collections by extending the preceding definition: we will say that a set R of triplets induces a triplet t when there is a compatible subset R' of R that induces t.

Characterizing Non-Contradiction and Induction by Triplets
Here we describe two important properties that veto method supertrees should satisfy. They concern topological relationships that a supertree should not contain with respect to the input trees: first, it should not contain relationships contradicting the source trees (PC property); moreover, it should only contain relationships that are induced by the input trees (PI property). Below we detail these two properties.

There are several ways for a supertree to contradict a collection of source trees. The most direct contradiction occurs when only one resolution appears for a group of taxa in one or several source trees, and the supertree contains a different resolution for the group. When different resolutions appear in source trees, as soon as the supertree proposes a resolution for the concerned taxa it contradicts at least one input tree. Contradictions are less direct when the supertree proposes a resolution that contradicts no single input tree but does contradict a combination of them.

A relationship contained in a supertree can present no contradiction with the input trees and still not be desirable. For instance, Figure 1 shows a collection of two source trees and four possible supertrees, T', T", T"', and T, for this collection. Both B and C are sister taxa of A in the source trees, but no information is present in these trees to resolve the clade A, B, C. Thus, the fully resolved supertrees T', T", T"' all take arbitrary decisions by proposing one of the possible resolutions for this clade. Here, T is the sole supertree not proposing an arbitrary resolution for the clade. Arbitrary resolutions are misleading as they display relationships that are not entailed by the input trees.


Figure 1
View larger version (20K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Figure 1 Supertrees can contain arbitrary resolution. Example of a collection {tau} = {T1,T2} of two source trees and several possible supertrees T', T", T"', and T. Unlike T, the supertrees T',T",T"' propose an arbitrary resolution for the clade, A, B, C.

 
The above properties can be formalized in different ways depending on the kind of topological relationship considered; e.g., clades, nestings, triplets, etc. Following Goloboff and Pol (2002) and Grunewald et al. (2006), we chose to focus on triplets. Given a collection {tau} of input trees and a candidate supertree T, RT,{tau} denotes the set of triplets of {tau} for which T proposes a resolution. More formally, RT,{tau} = {AB|C isin rt({tau}) suchthat {AB|C,AC|B,BC|A}{cap} rt(T)!= {emptyset}}. The set RT,{tau} corresponds to all topological information present in collection {tau} that is related to the information present in supertree T. Using this notation, we can express the induction property (PI) and the non-contradiction property (PC) as follows:
  • T satisfies PI for {tau} if and only if for all tisinrt(T), it holds that RT,{tau}){vdash} t. In other words, PI requires that each and every triplet of T is induced by RT,{tau}.
  • T satisfies PC for {tau} if and only if for all tisinrt(T) and all Formula , it holds that RT,{tau}Formula Formula . This means that, for each and every triplet of T, RT,{tau} induces no alternative resolution.

For instance, considering collection {tau} = {T1,T2} in Figure 2 and supertree T' of Figure 3, the set RT',{tau} is {AC|E, AC|F, AB|E, AB|F, BC|E, BC|F, EF|A, EF|B, EF|C}. Note that the triplet AD|C present in rt({tau}) due to T2 is not in this set because A,D,C are multifurcating in T'. When the source trees are incompatible, it is possible that R(T,{tau}) contains two different triplets for the same three taxa. For example, consider the supertree T in Figure 3 proposed by the MC and MMC voting methods (Semple and Steel, 2000; Page, 2002) on the collection {tau} = {T1,T2}. R(T,{tau}) contains both AB|C (resulting from T2) and AC|B (resulting from T1). In this case, the supertree T that contains the triplet t = AB|C does not satisfy PC, since Formula = AC|B is in RT,{tau} (hence, RT,{tau}{vdash} Formula ). Indeed, in this example, supertree T includes topological information contained in T2 that contradicts that of T1. This situation indirectly results from a difference in the sizes of the clades of T1 and T2, which are incompatible: the clade containing more taxa (here (A,B,D) in T2 versus (A,C) in T1) is favored in the MC-MMC supertree. Such a size bias effect has been well-known in the field since purvis (1995b) demonstrated it for the MRP voting method. Here it is illustrated for another voting method, and one might wonder whether this size bias is present in most voting methods. Note, however, that this size bias does not seem to have a major impact on MRP's accuracy (Baum and Ragan, 2004).


Figure 2
View larger version (9K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Figure 2 Example of a collection {tau} = {T1,T2} of two source trees.

 


Figure 3
View larger version (12K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Figure 3 Various supertrees for the collection of Figure 2. T is the supertree proposed by the MC (Semple and Steel, 2000) and MMC methods (Page, 2002). T' and T" are the supertrees respectively proposed by the BuildPC and PhySICPC algorithms described in this paper.

 
When the source trees are compatible, any reasonable method is expected to produce a supertree satisfying PC. However, some methods usually propose a supertree that does not satisfy PI. Indeed, compatible source trees can sometimes be displayed by an exponential number of supertrees, and some methods arbitrarily propose only one of them, thus selecting some triplets to the exclusion of other possible triplets. For instance, when considering the trivial case of two source trees AB|E and CD|E, both MC and MMC propose the supertree ((A,B),(C,D),E), whereas numerous supertrees are possible; e.g., ((A,C),(B,D),E). In such a case, it seems preferable to output a consensus of all possible supertrees, as done by MRP (e.g., Bininda-Emonds and Bryant, 1998). Unfortunately, some topological information of the source trees (e.g., triplets) can be absent from the obtained consensus as it can contain highly multifurcating nodes.

However, for some compatible collections of trees, it is possible to find a supertree that displays all triplets of the collection and is also refined by all other possible supertrees. More formally, a set R of triplets is said to identify a tree T if and only if T strictly displays R and T is refined by every tree T' that strictly displays R. A set R can identify at most one tree; thus, when the triplet set R = rt({tau}) of a collection {tau} of source trees identifies a tree, this tree is a canonical representation of all possible supertrees.

Considering practical collections {tau} of source trees, rt({tau}) will almost never identify a tree, either because this set is incompatible or because it does not identify a particular tree. Nevertheless, it is possible that a subset of the triplets in rt({tau}) identifies a tree T, and then the topological information contained in T exactly corresponds to a subset of the topological information contained in {tau}. Such a subset is most interesting when the triplets t it contains do not have an alternative resolution Formula in rt({tau}). This situation occurs for the subset RT,{tau} of rt({tau}) when the supertree T satisfies PI and PC.

Proposition 1 A tree T satisfies PI and PC for a collection {tau} of trees if and only if R(T,{tau}) identifies T.

The proof is given in Appendix 1. It is based on the fact that a set R identifies a tree T if and only if rt(T) = cl(R) (Grunewald et al., 2006: lemma. 2.1). This proposition confirms the relevance of PI and PC: having a supertree T that satisfies both of them highlights a part of rt({tau}) (namely R(T,{tau}) that exactly corresponds to a tree, i.e., does not contain arbitrary topological information, and moreover does not contradict any input tree. Such a feature is most desirable for supertrees inferred by veto methods.

Links with Other Advocated Properties
Properties similar to PI and PC were described in Goloboff and Pol (2002: 519) as "the property of [the supertree] displaying AB|C if it is found in some input tree or implied by some combination of input trees and no input tree or combination of input trees displays or implies AC|B or BC|A." These properties were also pointed out as being desirable by Grunewald et al. (2006). Using our formalism, they can be translated as follows for a supertree T representing a collection {tau}:

  • PI': for any tisin rt(T), it holds that rt({tau}) {vdash} t
  • PC': for any tisin rt(T) and for all Formula , it holds that rt({tau}) Formula Formula .

The essential difference between PI'-PC' and PI-PC is whether we evaluate supertrees based on triplets in the original set of trees, rt({tau}), or on the triplets commonly resolved by the supertree and at least one of the source trees, R(T,{tau}). From the statement of the properties, it is clear that PC' implies PC and PI implies PI'. It is thus natural to wonder which version of the properties is preferable. Below, we show an example where PC' is too restrictive, and an example where PI' is too permissive. In contrast, PI and PC behave correctly in these examples.

Example 1 Let {tau} = {T1,T2} with T1 and T2 as shown in Figure 4. rt({tau}) contains AE|B and AC|E; therefore, rt({tau}){vdash} AC|B. We also have rt({tau}) {vdash} AB|C because AB|C isin rt(T1). Thus any tree providing a triplet on {A,B,C} does not satisfy PC'. For analogous reasons PC' does not allow us to propose any triplet in the supertree. Thus, PC' rejects the tree T of Figure 4. Yet T is a reasonable and informative supertree for {tau} and satisfies both PI and PC.


Figure 4
View larger version (11K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Figure 4 Excluding rogue taxa from the analysis can lead to informative supertrees.

 
We note that T is not a plenary supertree, i.e., it does not contain all input taxa, but this example shows that removing rogue taxa is a way in which more informative supertrees can be obtained. This is in line with the remark of (wilkinson et al. 2004), who stated that "non-plenary supertree methods might be most useful for identifying unstable leaves." For instance, such leaves might be involved in lateral transfers. This example easily generalizes to cases where the supertree actually contains more leaves than each source tree. Figure 5 depicts this generalization.


Figure 5
View larger version (15K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Figure 5 Another example with rogue taxa. This figure presents a generalization of the example displayed in Fig. 4 to the case of a supertree containing more taxa than each input tree.

 
The next example shows a supertree satisfying both PI' and PC' while also displaying irrelevant triplets.

Example 2 Let {tau} = {T1,T2} with T1 and T2 as illustrated in Figure 6. rt({tau}) = {AB|C,AB|X,BC|A}. The tree T of Figure 6 displays {AB|X,BC|X,AC|X}. AB|X is present in (thus induced by) rt({tau}) the two other triplets can also be induced from rt({tau}): {AB|X,BC|A}{vdash}{BC|X,AC|X}. It follows that T satisfies PI'. Moreover, it is easily seen that no combination of triplets in rt({tau}), other than {AB|X,BC|A}, induces triplets. However, T is clearly not an ideal supertree for {tau} as no information in {tau} induces group A,B,C to nest inside group A,B,C,X. The property PI, not satisfied by T, detects this problem: here R(T,{tau}) only contains the triplet AB|X and thus it does not induce the triplet AC| present in T.


Figure 6
View larger version (9K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Figure 6 Contradiction in the source trees can lead to arbitrary resolution. An example where the presence of contradiction in the source trees (namely, AB|C in T1 versus BC|A in T2) can lead to the inferrence of arbitrary clades (namely, excluding X from the clade {A,B,C} in the supertree T). This problem is detected by PI but not by PI' nor PC'.

 
The PI' property quoted by Goloboff and Pol (2002) is stronger than the Pareto property (Neumann, 1983; Wilkinson et al., 2004) on triplets, which requires that the output tree contain all triplets and splits that occur in all source trees. The Pareto property is appealing in general and has also been advocated in the supertree context (property P6 of Steel et al., 2000). However, imposing the Pareto property on triplets may be problematic, even in the case of compatible source trees (Thorley and Wilkinson, 2003). This is due to the possibility of having several candidate supertrees that are both compatible with source trees and respect the Pareto property. In this case, no single supertree exists that satisfies the Pareto property while having no arbitrary resolution. The strict consensus of these supertrees does not necessarily satisfy the Pareto property. A solution is then to return several trees, either all candidate supertrees or their reduced consensus (wilkinson, 1994). However, this solution may not well be suited when the aim is to summarize a collection of source trees into a single supertree that is more easily dealt with for further analysis by biologists.

When source trees are incompatible, it may even be impossible to have a supertree satisfying both the Pareto and non-contradiction properties (PC and PC') as shown in the following example.

Example 3 Consider the collection {tau} = {T1,T2} where T1 = (((A,D),B),((C,F),E)) and T2 = (((A,E),(B,F)),(C,D)). Triplets AB|C and EF|D are displayed by both trees of {tau}. Thus, any supertree T for {tau} must include all leaves in {tau} in order to satisfy the Pareto property. Since rt({tau}) contains AB|D and AD|B, any tree T displaying a triplet for the three leaves does not satisfy PC (hence PC'). For similar reasons, no supertree T can display a triplet on the taxa A, C, and D. Thus, any supertree satisfying PC (or PC') and including all taxa of {tau} contains a multifurcating node on taxa A,B,C,D, and hence does not display the triplet AB|C; i.e., does not satisfy the Pareto property.

In other words, imposing the Pareto property can lead the supertree to explicitly contradict relationships present in some input trees. This shows that the Pareto property on triplets is not compatible with the veto approach, where the proposed supertree must not contradict the source trees. However, the Pareto property can be considered for other topological relationships (wilkinson et al., 2004). For example, there is always a supertree satisfying PI and PC as well as the Pareto property on partial or full splits contained in the source trees.

The Pareto property specifies relations that the supertree must contain. The complementary co-Pareto property specifies relations that the supertree must not contain. The co-Pareto property in the consensus context requires that the consensus tree contain no relationships that are not present in at least one input tree. However, (wilkinson et al., 2004) point out that this statement is not reasonable for supertrees, because "they might contain relashionships that are entailed by the input trees in combination, but are not present in any of them singly." Then they propose a weaker version that requires that the supertree does not contain relationships that are contradicted by all the input trees whose leaf set makes a contradiction possible. Note that, any supertree satisfying PC also satisfies the latter version of co-Pareto.

Steel et al. (2000) list five other properties that might be requested from supertree methods: changing the order of the trees in the input collection does not change the supertree (P1); renaming the taxa of the source trees gives the same supertree, but with the taxa renamed accordingly (P2); the output tree displays the source trees when they are compatible (P3); each leaf (taxon) that occurs in at least one source tree is in the supertree (P4); the running time of the method grows polynomially with respect to the total number of taxa (P5). The following example shows that ensuring P3 can force the supertree to contain arbirtrary clades. Thus, P3 can conflict with PI.

Example 4 Let {tau} = {T1,T2} with T1 = ((A,B),W) and T2 = ((A,B),(X,(Y,Z))). A supertree with taxon set {A,B,W,X,Y,Z} that satisfies P3 must display T2, hence must have a clade including Y,Z but not X. However, it will contain arbitrary clades, no matter where taxon W is attached. This is because any supertree satisfying PI must include a polytomy on W, X, Y, Z since source trees include no information on the relative position of W and the group X, Y, Z. For instance, the supertree ((A,B),(X,Y),W,Z) excludes the possibility for (A,B) and (X,Y) to be intermixed.

Note that if polytomies of a supertree are interpreted in terms of an Adams consensus (Adams, 1972), then this example does not put P3 into question. However, this interpretation of polytomies does not prevail in phylogenetics, as we discuss in further detail in the case study paragraph.

Checking PC and PI in Polynomial Time
Existing supertree methods can sometimes output trees that do not satisfy PC or PI. For instance, the MC supertree obtained for the collection of Figure 2 does not satisfy PC, whereas on that of Example 4, it fails to satisfy PI. In contrast, for the collection {AB|C,BC|D} the MC supertree satisfies both PI and PC. The MRP method sometimes outputs supertrees not satisying these properties (e.g., PC is not satisfied in Figure 1 of Bininda-Emonds and Bryant, 1998) and sometimes provides supertrees that satisfy them—e.g., when the source trees are compatible (Steel, 1992). We now describe an algorithm to decide whether a candidate supertree satisfies both PI and PC together. In case of a negative answer, it pinpoints those parts of the supertree contradicting these properties. This algorithm relies on two properties equivalent to PC and PI, whose formulation is less intuitive but whose checking is easy.

Definition 1 Let {tau} be the collection of source trees and T be a proposed supertree for {tau}. Define PCeq and PIeq to be the following properties:

  • PCeq: rt(T){cup} R(T,{tau}) is compatible.
  • PIeq: for any tisin rt(T) and for all Formula , the set { Formula }{cup} RT,{tau} is incompatible.

Proposition 2 (PIeq and PCeq) {Leftrightarrow} (PI and PC).

proof.

  • PCeq {Rightarrow} PC: PCeq {Rightarrow} {forall} tisin rt(T), {t}{cup} R(T,{tau}) is compatible. This ensures that there is at least one tree T' that displays R(T,{tau}){cup} {t}. It follows that T' displays R(T,{tau}) but not Formula . As Formula is not displayed by every tree that displays the compatible set R(T,{tau}), it follows that R(T,{tau})Formula .
  • PC {Rightarrow} PCeq: PC {Rightarrow} R(T,{tau}){subseteq} rt(T) (cf. proof of Proposition 1); this ensures that rt(T){cup} R(T,{tau}) is compatible (since displayed by T).
  • PI + PC {Rightarrow} PIeq: PI and PC {Rightarrow} cl(RT,{tau}) = rt(T) by Proposition 1. This ensures PIeq.
  • PIeq + PCeq {Rightarrow} PI: PCeq ensures that R(T,{tau}) is compatible. PIeq is exactly the definition of the induction for a compatible set, thus ensuring PI.

Note that we do not prove a direct equivalence between PI and PIeq in this general case. The two properties are only equivalent for a compatible set. In fact, PIeq is relatively uninformative without PCeq, since PIeq holds as soon as R(T,{tau}) is incompatible. Note also that another formulation of PC, closer to that of PIeq but less concise, is as follows: for anytisin rt(T), the set {t}{cup} R(T,{tau}) is compatible.

PIeq and PCeq can easily be checked by using the Build algorithm, which indicates in polynomial time whether a set of rooted trees is compatible or not. A similar procedure was proposed by Steel et al., (1992), and refined by Daniel (2004), to compute the strict consensus of all supertrees displaying a collection of compatible source trees. The following lemma provides us with an even faster way to check PCeq.

Definition 2 (Direct contradiction) A tree Tdirectly contradicts a set of triplets R when there is a triplet t in rt(T) such that exist Formula isin R. A supertree T is said to directly contradict a collection {tau} of source trees if T directly contradicts rt({tau}).

Direct contradictions are linked with the PC property in the following way:

Lemma 1 If a tree T does not directly contradict a collection {tau} of source trees then the three following statements hold:

  1. R(T,{tau}){subseteq} rt(T);
  2. R(T,{tau}) is compatible;
  3. T satisfies PCeq for {tau}.

proof By definition, R(T,{tau}) only contains triplets on three-taxon sets for which there is a triplet in rt(T). Because T does not directly contradict {tau}, the triplets of R(T,{tau}) are resolved as those in T. It follows that R(T,{tau}){subseteq} rt(T) (proving 1). R(T,{tau}) is therefore compatible (proving 2). Moreover, R(T,{tau}){subseteq} rt(T) ensures that R(T,{tau}){cup} rt(T) is compatible, which is exactly the formulation of PCeq (proving 3).

Thus, to check that a supertree T satisfies PCeq, and hence PC, for a collection {tau}, it suffices to check that any triplet of rt(T) is not resolved in a different way in a tree of {tau}. This can be done by computing the set rt(T) of O(n3) triplets in T and then comparing rt(T) with the set of triplets of each source tree Ti. If the collection {tau} contains k source trees and a total of n taxa, then this simple implementation requires O(kn3) computing time. However, it is possible to check this condition in linear time for each pair T,Ti with Tiisin{tau}: first restrict in O(n) time the trees T and Ti to the taxa they share; then apply the algorithm of Berry et al. (2005) that, given two trees with the same taxa, finds in O(n) time a triplet resolved differently in the trees, or states that this situation does not arise. Thus, successively considering k source trees leads to a procedure that checks PC in O(kn) computing time.


    PhySIC: A Polynomial-Time Veto Supertree Method
 Top
 Abstract
 Non-Contradiction and Induction...
 PhySIC: A Polynomial-Time Veto...
 Biological Case Studies on...
 Conclusion
 Appendix 1
 References
 
We introduced above the PI and PC properties, showed their relevance, and described algorithms to check whether a given supertree T satisfies them. In this section, we show that it is possible to design a method that always produces supertrees that satisfy PI and PC. However, this aim is not precise enough, as the star tree (the tree whose leaves are all children of a single internal node) trivially satisfies these properties—simply because it does not resolve any triplet. Thus, a reasonable aim is to design a method that always infers supertrees that satisfy PI and PC and that contain as much resolution as possible; e.g., resolve as many triplets as possible. More precisely, we require a method that, given any collection {tau}, proposes a supertree T such that R(T,{tau}) identifies T and R(T,{tau}) has maximum size over all such subsets of rt({tau}). Such a subset of rt({tau}) is called a maximum identifying subset of triplets (MIST).

The difficulty of this problem cannot be simply deduced from previously known theoretical results for optimization problems on triplets. Indeed, the MIST problem is a middle term between the NP-hard problem that consists of finding a maximum-sized compatible subset of triplets (Bryant, 1997) and the polynomial-time problem that asks for the maximum-sized tree-like subset of a complete set of triplets (Berry and Gascuel, 2000; Bryant and Berry, 2001). Unfortunately, the MIST problem is NP-hard (Guillemot and Berry, 2007). This shows that it is highly unlikely that a polynomial-time algorithm exists that could find the most resolved supertree satisfying PI and PC. However, we can still rely on heuristic algorithms to find reasonable (but potentially suboptimal) solutions, as is commonly done for other NP-hard problems such as finding a most parsimonious tree or a maximum likelihood tree for a character matrix.

We present below a polynomial-time heuristic method that always outputs a supertree that satisfies PI and PC. The method tries to produce a supertree that contains as many input triplets as possible under this constraint. The method is a variant of the well-known Build algorithm and is called PhySICPhylogenetic Signal with Induction and non-Contradiction. Supertrees inferred by the method have a degree of resolution that can be close to that of supertrees inferred by voting methods (see next section) while only containing clades that are not arbitrary with respect to the source trees nor contradicting them as detected by PC.

Inferring a Supertree that Satisfies PC
This section introduces algorithms, based on the Build algorithm, to produce nontrivial trees that satisfy PC.

The Build Algorithm.—
The Build algorithm is a yes-or-no algorithm that tells whether a collection of triplets or larger trees is compatible or not. To achieve its goal, the algorithm tries to build a tree displaying the triplets; if the process is blocked at some step, this means that the input triplets are not compatible. This tree is built recursively, from the root to the leaves. First, the largest clades are identified, then clades included in the first ones, and so on. The composition of the clades is guided by the structure of a graph; that is, a set of objects (called vertices) with links (called edges) between pairs of them.

The graph used by Build, called here the Aho Graph, is defined as follows: let R be a collection of triplets on a taxon set X, the Aho GraphG for R is the undirected graph with vertices X and with edges (A,B) between two taxa A and B whenever there is a triplet AB|Cisin R. Thus, an edge between two taxa means that at least one triplet sees these two taxa in the same clade. The vertices of G are denoted by v(G) (in the present description v(G) = X). For instance, Figure 7a shows the Aho graph built from R = rt({T1,T2}), where T1,T2 are the source trees of Figure 2 and, e.g., the edge between taxa B and D is due to the triplet bd|cisinrt(T2).


Figure 7
View larger version (16K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Figure 7 Examples of graphs. (a) The initial Aho graph G created from the triplets rt({tau}) of the collection {tau} displayed in Figure 2. The two connected components of G are C1 = {E,F} and C2 = {A,B,C,D}. (b) The Aho graph obtained from R|v(C2). This graph is connected, showing that the input trees conflict on the resolution of {A,B,C,D} and hence are incompatible. (c) The Aho graph obtained from R|v(C2) when removing the triplets Rdc = {AB|C,AC|B}.

 
A connected componentCi of a graph is a maximal set of taxa linked to one another; i.e., such that for any pair A,B of taxa in Ci, there is a set of edges that links A to B. For instance, the graph in Figure 7a contains two connected components: C1 = {E,F} and C2 = {A,B,C,D}. The connected components of graph G are denoted by CC(G). The vertices of a component Ci of G are denoted by v(Ci). When the Aho graph contains several connected components, each of them corresponds to a clade of the tree representing the input collection of triplets (if such a tree exists). Once these clades are known, the clades contained in each of these primary clades are found by recursively processing Aho graphs for subsets of triplets that respectively concern the taxa of these clades: the restriction of R to taxa of a component Ci is denoted by R|v(Ci) and defined as { AB|Cisin Rsuch that{A,B,C}{subseteq} v(Ci)}. For example, Figure 7b shows the Aho graph obtained from R|v(C2) where R is the set of triplets due to source trees in Figure 2, and C2 is the component of the initial Aho graph shown in Figure 7a. The recursive calls stop when dealing with components containing less than 3 taxa, because there is no triplet (hence incompatibility) on so few taxa. However, if at some point in the recursive process, the Aho graph for a set of at least three taxa has only one connected component, this means that the input trees are conflicting on the resolution of these taxa. When this happens, the algorithm states that the collection of source trees is incompatible. Otherwise, when all recursive calls return, the algorithm concludes that the source trees are compatible. For instance, when run on the collection of Figure 2, Build first finds two connected components, C1 = {E,F} and C2 = {A,B,C,D}, but the recursive call on C2 leads to a graph containing only one connected component (Figure 7b), which leads the algorithm to detect the incompatibility of the source trees.

A First Simple Modification of Build.—
We first describe here a simple modification of Build that infers a supertree from a collection of source trees {tau}. This subroutine, called BuildPC (Figure 8), takes as input the triplet set R = rt({tau}) of a collection {tau} of source trees and the list S of taxa contained in these trees. BuildPC mainly differs from Build when the Aho graph contains one connected component on the set S of taxa currently considered (line 1). In this case, BuildPC returns the star tree on S (i.e., a single polytomy on S, thus contradicting no input triplet), whereas Build simply concludes that the sources trees are incompatible. This star tree is then grafted as a subtree of the tree built by the previous recursive call. Thus, we can now output a supertree even when the source trees are incompatible. As an example, from the collection of Figure 2, BuildPC infers the supertree T' displayed in Figure 3.


Figure 8
View larger version (64K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Figure 8 Details of the BuildPC subroutine taking a set S of taxa and a set R of triplets on S as input.

 
Proposition 3 Given a collection R of triplets on a taxon set S, BuildPC returns a tree T on S that satisfies the PC property for R.

The proof of this proposition can be found in Appendix 1.

A More Involved Algorithm to Infer a Supertree Satisfying PC.—
BuildPC sometimes produces poorly resolved trees due to multifurcations returned in cases where G contains a single connected component (i.e., when R contains conflicts covering the considered subset of taxa). In the most extreme (though unlikely) case, this situation occurs at the first step of the algorithm, which then outputs a star tree.

The most basic conflicts between triplets of R occur when two different triplets t and Formula appear in R for a same set of three taxa. Such a direct contradiction cannot be present in a tree that satisfies PC. Given Rdc, the set of triplets such that t,Formula isinR it seems relevant to consider the subset R' = RRdc. We define a variant of BuildPC, called PhySICPC, that resorts to that subset whenever conflicts are detected. This enables the produced supertree T' to be generally much more resolved than the tree returned by BuildPC. For instance, Figure 7b shows the graph obtained for R|v(C2), where R are triplets of the collection in Figure 2 and C2 is the connected component shown in Figure 7a. This graph is connected due to the direct conflicts between AB|C (resulting from T1) and BC|A (resulting from T2). This situation leads BuildPC to return a polytomy on A, B, C, D. In contrast, building the graph on the basis of R' results in two connected components, Ci and Cj, allowing PhySICPC to propose a tree with two subtrees for taxa A,B,C,D.

The correctness of BuildPC ensures that T' satisfies PC with respect to R' but without any guarantee that this also holds with respect to R. To ensure the latter, and thus the correctness of PhySICPC, T' must not resolve any triplet of Rdc. A way to ensure this is to collapse any branch of T' that resolves a triplet of Rdc. The tree thus obtained is still always at least as resolved as the one proposed by BuildPC and potentially contains supplementary branches. Indeed, direct contradictions at the root of a clade no longer prevent the proposition of clades on subsets of its taxa. For instance, on the collection of Figure 2, the tree initially computed by PhySICPC is the tree called T in Figure 3. But as the branch leading to the clade (A,D,B) contradicts AC|BisinRdc, the branch above this clade is collapsed, and the final tree output by PhySICPC is then the tree named T" in Figure 3. This tree contains one clade more than the tree output by BuildPC (the tree named T' in the figure).

These ideas are included in the PhySICPC algorithm whose pseudo-code is detailed in Figure 9.


Figure 9
View larger version (118K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Figure 9 Details of the PhySICPC subroutine taking a set S of taxa and a set R of triplets on S as input.

 
Theorem 1 Given a triplet set R = rt({tau}) from a collection {tau} of source trees on a taxon set S of n leaves, PhySICPC returns in O(n4) time a tree T satisfying PC for {tau}.

The proof of this result can be found in Appendix 1.

Ensuring that the Supertree Satisfies PI
The supertree TPC output by PhySICPC does not usually satisfy the PI property. The PhySICPI algorithm transforms TPC so that it also satisfies PI. To that aim PhySICPI recursively searches the tree and checks that for each branch each triplet is induced by R(TPC,{tau}). Theorem 3.1.1 of Daniel (2004) provides a useful characterization to decide when a branch is justified, directly or indirectly, thanks to triplets present in R(TPC,{tau}). When considering the branch linking a node u to a child subtree Si, the theorem considers a graph Gij for any sibling subtree Sj of Si. Any such graph Gij is the Aho graph with vertices L(Si), and with edges due to triplets of R(TPC,{tau}) whose three leaves are in L(Si){cup} L(Sj). The theorem states that the branch from u to the root of Si is justified if and only if Gij is connected, for any sibling subtree Sj.

Example 5 Consider for instance the simple example where {tau} contains the trees ((A,B),X) and ((E,F),X). The Aho graph for rt({tau}) = {AB|X,EF|X} is made of three connected components: C1 = {A,B}, C2 = {E,F}, and C3 = {X}; therefore, applying the PhySICPC algorithm gives the tree TPC = ((A,B),(E,F),X). TPC displays AB|E even though this information is not induced by {tau}. Indeed, the branch defining the clade (A,B) is detected as not justified since the corresponding connected component, C1, is not connected in the Aho graph when we consider only edges due to triplets with taxa in C1{cup} C2.

This theorem is the basis of a decision algorithm called Identifies that states whether a given set of triplets identifies a given tree (Daniel, 2004). It is possible to design a simple variant of this algorithm that always returns a tree (not just a yes or no answer): when a branch between a node p and the root of a subtree Si is not justified, the idea is to replace Si by a star tree on the taxa of the corresponding clade. This crude variant removes the unjustified branches, but also potentially many other branches; i.e., those inside Si, those leading to sibling subtrees Sj of Si, and those inside Sj subtrees. PhySICPI is a more refined variant that only collapses the unjustified branches. See the pseudo-code in Figure 10 for details. In this code, PhySICPI is given a tree T in which unjustified branches are to be collapsed and a collection {tau} of source trees or, equivalently, the corresponding set of triplets (as written in the pseudo-code). PhySICPI repeatedly calls the CheckPI subroutine to detect unjustified branches that are then removed until none remains (note that in the pseudo-code of CheckPI, S(T) denotes (complete) subtrees connected to the root of T; i.e., the subtrees corresponding to the largest clades under the root of T).)


Figure 10
View larger version (131K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Figure 10 Details of the PhySIC algorithm and PhySICPI and CheckPI subroutines.

 
From the collection of Figure 2, PhySICPC infers the supertree T" displayed in Figure 3 and none of the three internal branches of T" are collapsed by CheckPI. For instance, consider the step where CheckPI checks the subtree ((A,D),B,C) of T", whose child subtrees are (A,D) plus the two trivial subtrees on B and C. The sole branch that has to be checked in ((A,D),B,C) is the one defining the clade (A,D). Here, CheckPI builds two Aho graphs with vertices {A,D}: one with edges due to triplets on {A,D}{cup} {B} and one with edges due to triplets on {A,D}{cup} {C}. Both graphs are connected thanks to triplets of the source tree T2; therefore, CheckPI does not collapse any branch at this step.

Theorem 2 Given a collection {tau} of trees and a tree satisfying PC for {tau}, PhySICPI returns in O(n4) time a tree T on L({tau}) that satisfies both PC and PI for {tau}.

The proof of this Theorem can be found in the appendix.

The PhySIC algorithm (see pseudo-code) builds a supertree for a collection of k source trees {tau} by first computing the set rt({tau}) and then successively calling PhySICPC and PhySICPI. Since rt({tau}) is computed in O(kn3), PhySIC runs in O(kn3+n4) time.

Theorem 3 Given a collection {tau} of k source trees on n leaves, PhySIC returns in O(kn3+n4) time a tree satisfying both PC and PI.

Lastly, we note that similar procedures can be designed to modify the supertree proposed by any existing supertree method. If a method proposes a supertree T that does not satisfy PC and PI, it is possible in polynomial time to transform T into a tree T' that satisfies these properties. Indeed, the algorithm indicated previously to check PC indicates the triplets from which the incompatibility arises. Then the branches of T inducing these triplets can be collapsed to obtain a tree T' satisfying PC. Now, PhySICPI can be applied to T' to ensure that it also satisfies PI (without invalidating PC).


    Biological Case Studies on Primates
 Top
 Abstract
 Non-Contradiction and Induction...
 PhySIC: A Polynomial-Time Veto...
 Biological Case Studies on...
 Conclusion
 Appendix 1
 References
 
To illustrate the impact of the PC and PI properties on supertree inference, and to compare the behavior of veto methods like PhySIC to that of voting methods like MRP and MMC, we present two case studies centered on Primates. This mammalian order is one of the first taxonomic groups for which a large-scale supertree approach has been conducted (Purvis, 1995a). The first example is designed to show the desirable properties of PhySIC compared to other supertree methods on a smaller, understandable taxonomic scale. The second example addressing the question of the primate supertree at the genus level shows how PhySIC performs on a larger taxonomic scale—approaching what supertree studies tend to be performed on—and shows that varying degrees of resolution are achieved in the supertree depending upon the nodes retained from the input trees.

First Example: Illustration of Supertree Desirable Properties
Source Trees.—We focused on a subsample of Primates IRBP (interphotoreceptor retinoid binding protein) and ADRA2B ({alpha}2B-adrenergic receptor) gene sequences, respectively, from Poux and Douzery (2004) and Poux et al. (2006), with a rodent (Mus) and lagomorph (Oryctolagus) outgroup. For ADRA2B, the hominoid representative was Pan, with the sequence downloaded from the chimp ENSEMBL project. The ADRA2B and IRBP source trees were inferred by maximum likelihood (ML) analysis of the corresponding alignments, using PHYML (Guindon and Gascuel, 2003), version 2.4.4, under a GTR+{Gamma}4+INV model of DNA evolution. The node support was estimated after 1000 bootstrap replicates using the same software and expressed as bootstrap percentages (BP). Denser taxonomic and phylogenetic information for Strepsirrhines (i.e., Lemurs and Galagos) was sought from a study of presence-absence of short interspersed nuclear elements (SINE) integrations in primate genomes (Roos et al., 2004: fig. 2). Sixty-one monolocus SINE characters detected by these authors were subjected to a maximum parsimony analysis using PAUP* (Swofford, 2002), version 4b10, with 1000 bootstrap replicates using a heuristic search, with 10 random additions of taxa, and TBR branch swapping. We only retained the best supported nodes of source trees; i.e., those showing at least 50% bootstrap (cf. also Daubin et al., 2002).

Comparison of Supertrees Inferred from PhySIC, MMC, and MRP
Starting from the three source trees (Figure 11), supertrees were built using the MMC, MRP, and PhySIC methods. For MRP, the matrix representation of the three source topologies resulted in 47 characters. Parsimony analysis was conducted under PAUP*, with a heuristic search with 1000 random addition sequences, and TBR branch swapping, resulting in 864 equally parsimonious trees, a strict consensus of which provided the MRP supertree. The MMC supertree was obtained using the program distributed by Rod Page. Figure 12 shows the supertrees respectively reconstructed by MMC, MRP, and PhySIC with its PhySICPC intermediate step.


Figure 11
View larger version (116K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Figure 11 Three source trees for the case study on primates. Majority rule consensus source trees derived from the bootstrap analysis of ADRA2B and IRBP sequences (in maximum likelihood) and SINE characters (in maximum parsimony). Bootstrap percentages are indicated on nodes, and only nodes defined by more than 50% are considered. Thin branches lead to the outgroup. Taxa in bold are handled differently by the different supertree algorithms and illustrate three different situations (arrows, letters, box: see main text). The two Tarsius species are T. bancanus (b) and T. syrichta (s).

 


Figure 12
View larger version (211K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Figure 12 Comparison of supertrees inferred by three methods: MMC, MRP, and PhySIC (the PhySICPC intermediate step of the latter is also displayed). Thin branches lead to the outgroup. Taxa in bold are handled differently by the different supertree algorithms and illustrate three different situations. (i) Arrows indicate the surprising positioning of Homo and Pan under the MMC and PhySICPC algorithms; (ii) A-B-C-X letters correspond to taxa arbitrarily grouped by MMC and PhySICPC (cf. Figure 2); (iii) boxes contain platyrrhine taxa for which MMC and MRP contradict the IRBP source topology. The P label on MMC and PhySICPC supertrees refers to the polytomy involving catarrhine and platyrrhine clades. The taxonomic frame for Primates is given on the PhySIC supertree. Hatched rectangles represent Anthropoidea (Catarrhini + Platyrrhini). White and black rectangles respectively represent Haplorrhini (Tarsiiformes + Anthropoidea) and Strepsirrhini (Lorisiformes + Lemuriformes).

 
The supertrees produced all contain some soft polytomies, each of them representing uncertainty about the resolution of a node's child subtrees or lineages. A soft polytomy can have two distinct interpretations, differing in the set of admissible fully resolved phylogenies it encompasses. Consider the case of the polytomous node P in the MMC tree of Figure 12. This node has three child subtrees S1 = (Homo, Hylobates), S2 = (Pan,(Cercopithecus, Macaca)), and S3, the Platyrrhinii clade. The most widespread meaning of a soft polytomy accepts any fully resolved tree on subtrees S1,S2,S3 that keeps their monophyly: ((S1,S2),S3), ((S1,S3),S2), or ((S2,S3),S1). Strict consensus, majority-rule consensus, and hence MRP, interpret polytomies in this way (Margush and McMorris, 1981). Polytomies proposed by PhySICPC are also to be interpreted in this way. A second interpretation of soft polytomies was introduced by the Adams consensus (Adams, 1972) and is also intended by MC (Semple and Steel, 2000) and MMC (Page, 2002). This interpretation accepts as possible phylogeny any fully resolved tree that maintains the structure of each subtree respectively, no matter whether or not S1,S2, and/or S3 are kept monophyletic (i.e., their leaves can be interleaved). Thus, the polytomy P of the MMC tree in Figure 12 can indeed give rise to fully resolved trees grouping Pan and Homo without Hylobates, as long as Pan is kept outside the clades containing Cercopithecus and Macaca (which is the structure imposed by S2). Under this interpretation, a soft polytomy represents a much wider range of fully resolved phylogenies than with the first interpretation, and is harder to interprete in a phylogenetic context. (In particular, this means that simulation studies on supertree methods that use the Robinson and Foulds distance to evaluate the performance of MC or MMC are misleading: on the previous example, the MMC method would have been considered to propose the incorrect clades Homo + Hylobates, and Pan + Hylobates.)

The contribution of PhySICPC to the supertree inference may be illustrated by the situation among platyrrhines. Here, ADRA2B indicates that Ateles is the sister-group of Pithecia, Callithrix, and Cebus, whereas IRBP indicates that Ateles and Cebus are the closest relatives (Figure 11: boxed areas). This conflict is detected by PhySICPC. As a result, the PhySICPC and PhySIC supertrees display all four platyrrhines within a multifurcation. By contrast, MRP and MMC give priority to the Callithrix + Cebus grouping present in the ADRA2B source tree, and thus contradicts the Ateles + Cebus grouping present in the IRBP source tree. Resolution of this conflict between the source trees reflects the voting approach followed by MRP and MMC. For instance, consider the case of MRP: the ADRA2B source topology comprises two nodes within platyrrhines, against one node for the IRBP topology. Therefore, MRP favors the node Callithrix + Cebus involved in a topological conflict but belonging to a larger and more resolved clade (Bininda-Emonds and Bryant, 1998). The behavior of MRP and MMC on platyrrhines is problematic. Indeed, it favors one source topology while contradicting another, just on the basis of their respective levels of resolution, and despite the fact that both contain the same number of taxa for the platyrrhine subtree. MRP has already been criticized on this point (e.g., Goloboff, 2005). Note that source trees also conflict on the position of Propithecus with respect to Microcebus and Lemur. However, in this case, MRP behaves as PhySICPC and PhySIC; i.e., displays a polytomy on groups containing these three taxa (Figure 12: letters A-B-C). By contrast, MMC groups together the Propithecus and Lemur clades, following the SINE information, but contradicting the IRBP information.

The complementary contribution of PhySICPI to the supertree inference may be illustrated by the situation among Catarrhines + Platyrrhines. Although not contradicting the source trees, the PhySICPC supertree contains two topological errors. First, man and chimp do not group together relative to the gibbon, as would be expected from a plethora of data (Goodman et al., 2005). Homo is instead associated with Hylobates, whereas Pan branches with the two cercopithecoids, Cercopithecus and Macaca. This situation results from the taxon sampling of the source topologies. More precisely, man and chimp are not simultaneously present in any source tree; i.e., the former clusters with the gibbon (IRBP) and the latter with cercopithecoids (ADRA2B). These two source clades are reproduced in PhySICPC and MMC supertrees.

In the case of PhySICPC, these two clades are involved in a polytomy with the platyrrhines. This polytomy means that (Homo, Hylobates) is a sister clade of the clade containing Pan. However, although these clades are correct when considered separately, they should not be sister groups in the supertree. PhySICPI detects this situation of arbitrary resolution and collapses the corresponding branches, thus the final PhySIC supertree allows for a group (Pan, Homo). MMC displays the same polytomy as PhySICPC but with a different meaning: the interleaving interpretation of this soft polytomy means that MMC does not reject the expected resolution, namely grouping (Pan, Homo) as a sister clade of Hylobates. In conclusion, both MMC and PhySIC allows for the expected group (Pan, Homo), but note that the PhySIC supertree is more accurate, as its polytomy does not allow the catarrhine taxa Homo, Hylobates, or Pan to branch within the platyrrhines. Here, MRP does not introduce arbitrary resolutions and proposes a polytomy involving the five catarrhine taxa.

Another problem of MMC and PhySICPC supertrees is that Lepilemur is the sister-group of all Lemuriformes but Daubentonia, whereas this topological information is not present in the only source tree (SINEs) for which Lepilemur is scored. This result is explained by the fact that the restriction of IRBP and ADRA2B source topologies to taxa lettered A-B-C-X leads to the situation described on Figure 6. Thanks to the PI property, the PhySIC algorithm again corrects this problem and displays a polytomy involving the major clades of lemuriformes, together with Lepilemur (Figure 12). The same polytomy is also proposed by MRP. Overall, this first case study illustrates that the two properties introduced in the present work help to identify and manage the potential arbitrary and conflicting resolutions arising in supertrees when combining independent source topologies.

Second Example: A PhySIC Supertree of Primate Genera
Primary Data and Source Tree Inference
We used 24 data sets to reconstruct the primate phylogeny: two mitochondrial DNA (mtDNA), 19 nuclear DNA, and three transposable elements data sets. All sequences used in this study were retrieved from EMBL-Genbank databases. The sampling of genes and other molecular markers is detailed in Table 1. The corresponding data are available under TreeBASE accession numbers S1879 and M3455. This combined data set encompasses 95% of all primate extant genera (Wilson and Reeder, 2005); i.e., 66 genera. Two subfossil genera from ancient DNA analyses were also included (Karanth et al., 2005). All genes were aligned with Clustal X (Thompson et al., 1997) with subsequent manual refinement. We used Mus and Rattus as outgroups in all analyses for which sequence data was available. Each gene was analyzed with the ML criterion under the best fitting model (Table 1). Separate ML phylogenetic reconstructions and bootstrap analyses were performed with PHYML (Guindon and Gascuel, 2003) as described in previous section. Maximum parsimony phylogenetic reconstruction and bootstrap analysis on the three transposable element data sets were also conducted using PAUP* as described in previous section. Clades of the source trees with BP values above a specified threshold were retained. To evaluate the influence of this parameter, five PhySIC supertrees were inferred by respectively considering BP ≥ 50%, 60%, 70%, 80%, and 90%. Each run of PhySIC took less than 4 s on an Intel MacBook.


View this table:
[in this window]
[in a new window]

 
Table 1. Molecular markers used to infer the primate source trees. The best-fitting model and the number of primate genera per marker are indicated. (mtDNA: mitochondrial DNA)

 
The Major Clades of Primate Genera
The most resolved supertree reconstructed by PhySIC is obtained when source trees were restricted to nodes supported by more than 70% bootstrap (Figure 13; BP ≥ 70%). This topology conforms to current ideas on primate phylogeny and is close to the informal supertree of Primates at the genus level proposed by Goodman et al. (2005). In addition, we here extend their taxon sampling with the three extant genera Euoticus, Piliocolobus, and Simias, and the two subfossil genera Megaladapis and Paleopropithecus. Our supertree displays the fundamental dichotomy among Primates between Strepsirrhini and Haplorrhini. Strepsirrhines then split into Lorisiformes (Lorises and Galagos) and Lemuriformes (lemurs and Daubentonia, the aye-aye). Haplorrhines also split into Tarsiers and Anthropoids. The latter clade subsequently divides into monophyletic New World primates (Platyrrhini) and Old World primates (Catarrhini). Platyrrhini display a trifurcation involving the three families Atelidae (the Ateles + Alouatta clade), Pitheciidae (the Pithecia + Callicebus clade), and Cebidae (the Cebus + Saimiri + Aotus + Saguinus clade). Catarrhines split into Hominoidea (gibbons and apes) and Cercopithecoidea (colobines and cercopithecines).


Figure 13
View larger version (210K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Figure 13 Primate PhySIC supertree including 95% of all extant genera and containing no contradiction or arbitrary resolution with respect to the source trees, as defined by the PI and PC properties. The genus-level primate supertree has been reconstructed from 24 molecular source trees restricted to nodes supported by more than 70% bootstrap. When the bootstrap threshold is increased to 80%—respectively, 90%—the supertree topology changes: disappearing branches as well as one appearing clade (Papio + Theropithecus) are indicated by white—respectively, black—stars. Polytomies are labeled by tags pointing out the properties (PI, PC, or both) that would not be satisfied if the corresponding clade was more resolved. The taxonomic frame and clade names for Primates are given. Hatched rectangles represent Anthropoidea (Platyrrhini + Catarrhini). White and black rectangles respectively represent Haplorrhini (Tarsiiformes + Anthropoidea) and Strepsirrhini (Lorisiformes + Lemuriformes).

 
Identifying and Labeling the Causes of Supertree Polytomies
Because veto methods are used for evaluating the topological congruence of source trees, and for measuring their degree of leaf overlap, the PhySIC program outputs labels on each polytomous node. A label "C" (standing for Contradiction) indicates that the polytomy results from contradictions among the source trees on phylogenetic relationships of corresponding taxa: proposing a resolution for the polytomy would contradict at least one source tree; i.e., would not respect the PC property. A label "I" (standing for Induction) indicates a lack of cross-information in the source trees: any dichotomous resolution of the clade would be at least partially arbitrary and thus would not respect the PI property. Note that a given label applies only to the node to which it is assigned but not to other nodes in its subtrees. For instance, in the primate genera supertree (Figure 13), the platyrrhine trifurcation (Atelidae, Pitheciidae, Cebidae) with a C label indicates that there is topological contradiction among the source trees about the sister-group relationships of these three families. However, the C label does not put the monophyly of Atelidae, Pitheciidae, and Cebidae into question. Note also that a same polytomy can be characterized by both C and I labels. This means that the inability of the supertree to propose a dichotomous resolution is partly due to a lack of taxonomic overlap, and partly due to contradictions. For example, Figure 13 shows that the clade Cercopithecus, Erythrocebus, Chlorocebus, and Miopithecus is tagged by both C and I, reflecting two problems. On the one hand, source topologies disagree about the placement of Erythrocebus: this genus is either related to Cercopithecus (as suggested by IRBP exon 1) or to Chlorocebus (cf. the TSPY and chromosome Xq13.3 markers, and the Alu characters of Xing et al. [2005]). On the other hand, the input trees analyzed here do not provide the information required to know whether Miopithecus is the sister group of Cercopithecus, or is that of Erythrocebus + Chlorocebus, or is the most basal genus in the clade.

Impact of the Robustness of Source Trees on Veto Supertree Resolution
The number of clades retained from the original source trees depends on the bootstrap threshold imposed to select them for supertree inference. Choosing a low threshold thus increases the number of retained source clades and hence lowers the number of polytomies due to a lack of cross-information among source trees but increases the number of polytomies due to conflicts among source trees. Increasing the threshold has the opposite effect. The primate supertree of Figure 13 was obtained with BP ≥ 70%. Lowering the threshold to BP ≥ 50% or BP ≥ 60%, PhySIC yields a completely multifurcating supertree, due to weakly supported clades that conflict among source trees. When the bootstrap stringency is increased from the BP ≥ 70% to BP ≥ 80% threshold, a similar level of resolution in the genus level phylogeny is obtained with the exception of two additional polytomies: the first involves Indriidae (Indri + Avahi + Propithecus + Paleopropithecus) relative to other lemuriformes, and the second involves Allenopithecus relative to the Cercopithecus clade (white stars in Figure 13 refer to disappearing branches). Interestingly, increasing the threshold removes a topological conflict among Lophocebus, Papio and Theropithecus: with the PC property being satisfied, then the PhySIC supertree groups together the latter two genera. At the BP ≥ 90% threshold, 7 additional polytomies with respect to the BP ≥ 70% topology appear (Figure 13: black stars refer to node collapsing). This reflects the fact that less source nodes (i.e., the nodes of source trees) are available for supertree inference. The PI property is thus less often satisfied in the PhySICPC supertree, leading to a greater number of irresolutions in the PhySIC supertree.

Overall, two reasons can lead the PhySIC method to propose a poorly resolved supertree. First, it is possible that the source trees contain too little cross-information for the method to decide how the taxa of the respective source trees branch relatively to each other. In this case, all methods, including voting methods, will produce unresolved supertrees. Obtaining more resolved supertrees can then only be achieved by adding new source trees containing new clades on the key taxa. The second reason why the PhySIC supertree can lack resolution is the presence of topological conflicts among source trees. Like other veto methods, PhySIC is very sensitive to incongruences in the source trees. Thus, to obtain a well-resolved tree, a preliminary process whereby unreliable clades are collapsed in the source trees is usually necessary before applying the method. This collapsing can be done on the basis of the support values provided on the clades by most phylogenetic inference methods (e.g., bootstrap values, Bayesian posterior probabilities, Bremer support). We showed that a well-resolved supertree of Primates can be obtained with such an approach from a nontrivial number of gene trees. Note that on some data sets, contradicting clades showing high support values can occur, e.g., due to lateral gene transfers. In such cases, veto methods will still produce unresolved supertrees (as long as they are not allowed to exclude rogue taxa). This can be seen as a drawback or as a way to pinpoint such events. In such cases, outlier source trees can be identified (Shimodaira and Hasegawa, 1999; Lerat et al., 2003) and then curated or removed from the collection of source trees, leading to a more resolved supertree.


    Conclusion
 Top
 Abstract
 Non-Contradiction and Induction...
 PhySIC: A Polynomial-Time Veto...
 Biological Case Studies on...
 Conclusion
 Appendix 1
 References
 
Veto supertree methods are of interest for combining source topologies containing reliable clades. Their study also brings insight for the characterization of what we expect from voting methods. Indeed, when source trees are not conflicting, there is no fundamental difference between the two approaches. In such cases, veto and voting approaches should lead to reasonable supertrees. What reasonable means can be characterized by several formal properties. In the present work, we showed pitfalls of some previously proposed supertree properties, and also proposed new properties. In the general case of conflicting source trees, we believe there is still room for improvement, e.g., detecting arbitrary clades of a supertree even when it partially conflicts with some source trees, as usually happens in the voting context. With the new theoretical material at hand we believe that this is a reasonable goal.


    Appendix 1
 Top
 Abstract
 Non-Contradiction and Induction...
 PhySIC: A Polynomial-Time Veto...
 Biological Case Studies on...
 Conclusion
 Appendix 1
 References
 
Proof of Proposition 1
The proof is based on the fact that a set R identifies a tree T if and only if rt(T) = cl(R) (Grunewald et al., 2006, lemma 2.1).

  • By definition, R(T,{tau}) only contains triplets on three-taxon sets for which there is a resolution in T. Moreover, PC ensures that any such triplet cannot be resolved in R(T,{tau}) differently than that in T. It follows that R(T,{tau}){subseteq} rt(T). R(T,{tau}) is therefore compatible and cl(RT,{tau}){subseteq} cl(rt(T)) = rt(T). Meanwhile, having proved that R(T,{tau}) is compatible, it is clear from PI that cl(RT,{tau}) ⊇ rt(T).
  • Having R(T,{tau}) identifying T ensures that R(T,{tau}) is compatible. On one hand, cl(RT,{tau}) = rt(T) implies that for all tisin rt(T), tisin cl(RT,{tau}); i.e., R(T,{tau}){vdash} t and therefore PI holds. On the other hand, given tisin rt(T), if R(T,{tau}){vdash} Formula then, by definition of the closure, Formula isin cl(R(T,{tau}) = rt(T) so that both t and Formula are in rt(T), which is not possible. This proves that R(T,{tau})Formula Formula for any tisin rt(T); i.e., PC holds.

Proof of Proposition 3
Let AB|Cisinrt(T) be a triplet of the output supertree T and consider the recursive step where the tree returned in line 3 hosts A,B in the same subtree Ti, whereas C is in another subtree, say Tj. This means that A,B are vertices in a connected component Ci of the current Aho graph G, while C is in another component Cj of CC(G). If R contained AC|B or BC|A, then Ci and Cj would not have been distinct connected components (in such cases, graph G would have contained the edge (A,C) or (B,C)). Thus, for any triplet t of rt(T), we know that no triplet Formula is in R; i.e., T does not directly contradict R. This ensures that BuildPC returns a tree satisfying PC (lemma 1).

Proof of Theorem 2
Correctness of the Algorithm
The triplets present in the tree TPC returned by PhySICPC depend on the internal branches of this tree. Any internal branch of TPC is created at line 10 during some recursive step, thus linking a subtree Ti to the root of the tree returned by this step. Thanks to Lemma 1, we just have to prove that every branch created at this line does not generate a triplet that directly contradicts R.

At this line, either (i) Formula PC corresponds exactly to the connected components of G (we have |CC(G)| > 1) and then, from Proposition 3, the triplets created are not in direct contradiction with R; or (ii) Formula PC corresponds to a partition of the vertices of G based on R'. Two cases are then possible. In the first case (line 4), G' is connected and the current call returns a multifurcation that generates no triplet and hence does not invalidate PC. In the second case, the Repeat loop ensures that for each set of three taxa A,B,Cisin v(G), Formula PC does not contain two elements Ci and Cj with A,Bisin Ci and Cisin Cj such that either AC|B or BC|A belongs to R = R'{cup}Rdc. Indeed, AC|B (and BC|A) cannot be in Rdc since Ci would have been removed from Formula PC (line 9). Moreover, if AC|B (respectively BC|A) was in R', then A and C (respectively B and C) would have been in the same component of Formula PC contradicting Ci!= Cj. This proves that the tree TPC built in line 10, and whose subtrees bijectively correspond to the elements of Formula PC, is such that no triplet of rt(TPC) directly contradicts R.

Time Complexity of the Algorithm.—The most time-consuming operations in PhySICPC are the computation of R'|v(Ci) and G'i (line 8), and that of the connected components of this graph (line 9). Obtaining R'|v(Ci) and constructing G'i requires considering each triplet of R at most once and thus has a time complexity of O(n3). Determining the CC(G'i)s costs O(n2) (which is the maximum number of edges for a graph with n vertices). During the whole set of recursive calls to PhySICPC, Formula PC is modified at most O(n) times (proportional to the number of clades of a tree with n leaves). Lines 8 and 9 are executed as many times as Formula PC is modified; i.e., O(n) times. Thus, for the whole set of recursive calls to PhySICPC, the computation time required by these critical lines is O(n4), which is also the complexity of the entire procedure.

Proof of Theorem 2
Correctness of the Algorithm
we first prove that PhySICPI always returns a tree, denoted by TPI, and then that TPI satisfies both PC and PI. By hypothesis, PhySICPI is called with a tree TPC that satisfies PC. Because PhySICPI modifies this tree by only collapsing some of its branches (possibly none), the tree considered in any execution CheckPI never directly contradicts {tau}. This ensures that the PhySICPI subroutine never exits in line 12; i.e., PhySICPI always returns a tree. Moreover, being a contraction of TPC, this tree satisfies PC.

CheckPI differs from the Identifies algorithm in that Return "tree not identified" in the latter is replaced by line 14 in the former. Given a tree T and a set R of triplets, Identifies (T,R) returns yes if R identifies T (otherwise it returns no) (Daniel, 2004, theorem 3.1.1). This ensures that when a call to CheckPI(T,RPI) issued by PhySICPI in line 11 collapses no branch of TPI, then the set R identifies this tree. Since the tree TPI returned by PhySICPI is such that it is not modified by the last run of CheckPI, then TPI is identified by R(TPI,{tau}) = RPI. In other words, TPI satisfies both PI and PC for {tau} (Prop. 1).

Time Complexity of the Algorithm
As for PhySICPC, the most time-consuming operations done by PhySICPI are the construction of the Aho graph Gij and the computation of its connected components in the CheckPI subroutine. The Gi graphs that may be used in CheckPI can be precomputed in the PhySICPI part of the pseudo-code (i.e., before calling CheckPI), knowing rt({tau}) and the current tree TPI to be examined in CheckPI. This preprocess clearly requires O(n4) time, since there are O(n) such graphs (one for each clade of T), each of which is obtained by examining the O(n3) triplets of R(TPI,{tau}). Each Gij graph can be obtained from a copy of the corresponding Gi graph, completed by the edges due to triplets AB|C having A,Bisin Ci and Cisin Cj. All the Gij graphs required during the recursive calls to CheckPI resulting from an execution of line 11 in PhySICPI can also be precomputed in the PhySICPI pseudo-code part. This can be done just before line 11, provided that CheckPI is modified to end as soon as an edge is collapsed (line 14)—it is clear that this slight modification does not modify the correctness of the algorithm. Indeed, the only Gijs that are then required by CheckPI are those corresponding to two sibling clades Ci and Cj of the current TPI tree. Computing all of these Gijs before line 11 of PhySICPI is done in O(n3) since each triplet AB|C of rt(TPI) adds an edge between A and B in the one and only graph Gij, such that Ci and Cj are sibling clades in TPI and A,Bisin Ci and Cisin Cj.

Note also that the only information used by CheckPI on graph Gi and Gij is the number of their connected components. The total number of edges present in the Gij graphs is in O(n3): precomputation of the number of connected components for this set of graphs is thus globally O(n3) time. As this has to be done at each pass of the Repeat loop, and as this loop is done at most O(n) times (each pass results in the collapsing of one of the O(n) clades of T), this part of the computation is globally (on the whole for PhySICPI) in O(n4) time. Determination of the number of connected components of each Gi is done only once just before the Repeat loop. For each of these O(n) graphs, this requires examining O(n3) triplets. Thus, this preprocess also costs O(n4) time. The preprocesses done for Gi and Gij graphs thus require O(n4) time and reduce the running time of CheckPI. The modification of CheckPI, consisting of returning to PhySICPI as soon as an edge is collapsed, also simplifies the algorithm (e.g., the Repeat loop is no longer required).

Thanks to the preprocessing, the only time-consuming operation in CheckPI for the current tree TPI is the examination of the O(n2) pairs of sibling clades Ci and Cj of this tree. Operations performed for each of this pair of clades is in O(1) (the number of connected components of useful graphs G, Gi and Gij have been preprocessed). Because a new tree TPI can only be obtained by collapsing one of the O(n) edges (line 14) of TPI, this can at most occur O(n) times. Therefore, all executions of CheckPI issued by a run of PhySICPI are in O(n3) time. Thus, the whole complexity of the PhySICPI algorithm is no more than the cost of the preprocessing; i.e., O(n4) time.


    Acknowledgements
 
We thank J. Cotton, R. Page, O. Bininda-Emonds, and an anonymous reviewer for many invaluable remarks on a first version of this manuscript. This work has been supported by the "ACI Informatique-Mathématique-Physique en Biologie Moléculaire [ACI IMP-Bio]," by the "Action incitative BIOSTIC-LR," by IFR119 "Biodiversitée Continentale Méditerranéenne et Tropicale" (Montpellier), and by the Research Networks Program in BIOINFORMATICS of the High Council for Scientific and Technological Cooperation between France and Israel. This publication is the contribution no. 2007-053 of the Institut des Sciences de l'Évolution de Montpellier (UMR 5554–CNRS). We thank V. Lefort for creating the web site of PhySIC.


    References
 Top
 Abstract
 Non-Contradiction and Induction...
 PhySIC: A Polynomial-Time Veto...
 Biological Case Studies on...
 Conclusion
 Appendix 1
 References
 

    Adams E. Consensus techniques and the comparison of taxonomic trees. Syst. Zool. (1972) 21:390–397.[Abstract/Free Full Text]

    Aho A. V., Sagiv Y., Szymanski T. G., Ullman J. D. Inferring a tree from lowest common ancestors with an application to the optimization of relational expressions. SIAM J. Comp. (1981) 10:405–421.[CrossRef]

    Bandelt H.-J., Dress A. W. M. Reconstructing the shape of a tree from observed dissimilarity data. Adv. Appl. Math. (1986) 7:309–343.[CrossRef]

    Baum B. R. Combining trees as a way of combining data sets for phylogenetic inference, and the desirability of combining gene trees. Taxon (1992) 41:3–10.[CrossRef][Web of Science]

    Baum B. R., Ragan M. A. The MRP method. In: Phylogenetic supertrees: Combining information to reveal the Tree of Life—Bininda-Emonds O., ed. (2004) The Netherlands: Kluwer. 17–34.

    Berry V., Gascuel O. Inferring evolutionary trees with strong combinatorial evidence. Theor. Comput. Sci. (2000) 240:217–298.

    Berry V., Guillemot S., Nicolas F., Paul C. On the approximation of computing evolutionary trees. Wang L., ed. (2005) 3595. Proceedings of the 11th International Computing and Combinatorics Conference (COCOON1905). LNCS. Springer. 115–125.

    Berry V., Nicolas F. Maximum agreement and compatible supertrees. In: Proceedings of CPM—Sahinalp S. C., Muthukrishnan S., Dogrusoz U., eds. (2004) 205–219. volume 3109 of LNCS.

    Berry V., Semple C. Fast computation of supertrees for compatible phylogenies with nested taxa. Syst. Biol. (2006) 55:270–288.[Abstract/Free Full Text]

    Bininda-Emonds O. R. P. Novel versus unsupported clades: Assessing the qualitative support for clades in MRP supertrees. Syst. Biol. (2003) 52:839–848.[Abstract/Free Full Text]

    Bininda-Emonds O. R. P. The evolution of supertrees. Trends Ecol. Evol. (2004a) 19:315–322.[CrossRef][Medline]

    Bininda-Emonds O. R. P. Phylogenetic supertrees: Combining information to reveal the tree of life, volume 4 of Computational biology series (2004b) The Netherlands: Kluwer.

    Bininda-Emonds O. R. P., Bryant H. N. Properties of matrix representation with parsimony analyses. Syst. Biol. (1998) 47:497–508.[Web of Science][Medline]

    Bryant D. Building trees, hunting for trees and comparing trees: Theory and method in phylogenetic analysis (1997) UK: Department of Mathematics, University of Canterbury. PhD thesis.

    Bryant D., Berry V. A structured family of clustering and tree construction methods. Adv. Appl. Math. (2001) 27:705–732.[CrossRef]

    Bryant D., Semple C., Steel M. Supertree methods for ancestral divergence dates and other applications. In: Phylogenetic supertrees: Combining information to reveal the Tree of Life—Bininda-Emonds O., ed. (2004) The Netherlands: Kluwer. 129–150. volume 4 of Computational biology series.

    Chen D., Diao L., Eulenstein O., Fernández-Baca D., Sanderson M. J. Flipping: A supertree construction method. In: Bioconsensus, volume 61 of Series in discrete mathematics and theoretic computer science, DIMACS (2003) Providence, Rhode Island: American Mathematics Society. 135–160.

    Cotton J. A., Slater C. S. C., Wilkinson M. Discriminating supported and unsupported relationships in supertrees using triplets. Syst. Biol. (2006) 55:345–350.[Free Full Text]

    Criscuolo A., Berry V., Douzery E. J. P., Gascuel O. SDM: A fast distance-based approach for (super)tree building in phylogenomics. Syst. Biol. (2006) 55:740–755.[Abstract/Free Full Text]

    Daniel P. Supertree methods: Some new approaches (2004) UK: University of Canterbury. Master's thesis.

    Daniel P., Semple C. Supertree algorithms for nested taxa. In: Phylogenetic supertrees: Combining information to reveal the Tree of Life—Bininda-Emonds O., ed. (2004) The Netherlands: Kluwer. 151–171. volume 4 of Computational biology series.

    Daubin V., Gouy M., Perriére G. A phylogenomic approach to bacterial phylogeny: Evidence of a core of genes sharing a common history. Genome Res. (2002) 12:1080–1090.[Abstract/Free Full Text]

    Davies T. J., Barraclough T. G., Chase M. W., Soltis P. S., Soltis D. E., Savolainen V. Darwin's abominable mystery: Insights from a supertree of the angiosperms. Proc. Natl. Acad. Sci. USA (2004) 101:1904–1909.[Abstract/Free Full Text]

    Dekker M. C. Reconstruction methods for derivation trees (1986) The Netherlands: University of Amsterdam. Master's thesis.

    Delsuc F., Brinkmann H., Philippe H. Phylogenomics and the reconstruction of the tree of life. Nat. Rev. Genet. (2005) 6:361–375.[Web of Science][Medline]

    Dutheil J., Gaillard S., Bazin E., Glemin S., Ranwez V., Galtier N., Belkhir K. Bio++: A set of C++ libraries for sequence analysis, phylogenetics, molecular evolution and population genetics. BMC Bioinformatics (2006) 7:188.[CrossRef][Medline]

    Goloboff P. A. Minority-rule supertrees? MRP, compatibility, and MinFlip may display the least frequent groups. Cladistics (2005) 21:282–294.[CrossRef][Web of Science]

    Goloboff P. A., Pol D. Semi-strict supertrees. Cladistics (2002) 18:514–525.[Web of Science]

    Goodman M., Grossman L. I., Wildman D. E. Moving primate genomics beyond the chimpanzee genome. Trends Genet. (2005) 21:511–517.[CrossRef][Web of Science][Medline]

    Gordon A. G. Consensus supertrees: The synthesis of rooted trees containing overlapping sets of labelled leaves. J. Classif. (1986) 3:335–348.[CrossRef]

    Grunewald S., Steel M. A., Swenson M. S. Closure operations in phylogenetics. Math Biosci. (2007) 208:521–537.[CrossRef][Web of Science][Medline]

    Guillemot S., Berry V. Finding a largest subset of rooted triples identifying a tree is an NP-hard task (2007) Technical report, LIRMM, University of Montpellier 2.

    Guindon S., Gascuel O. A simple, fast and accurate method to estimate large phylogenies by maximum-likelihood. Syst. Biol. (2003) 52:696–704.[Abstract/Free Full Text]

    Huson D. H., Nettles S. M., Warnow T. J. Disk-covering, a fast-converging method for phylogenetic tree reconstruction. J. Comput. Biol. (1999) 6:369–386.[CrossRef][Web of Science][Medline]

    Karanth K. P., Delefosse T., Rakotosamimanana B., Parsons T. J., Yoder A. D. Ancient DNA from giant extinct lemurs confirms single origin of Malagasy primates. Proc. Natl. Acad. Sci. USA (2005) 102:5090–5095.[Abstract/Free Full Text]

    Lapointe F.-J., Cucumel G. The average consensus procedure: Combination of weighted trees containing identical or overlapping sets of taxa. Syst. Biol. (1997) 46:306–312.[Abstract/Free Full Text]

    Lerat E., Daubin V., Moran N. A. From gene trees to organismal phylogeny in prokaryotes: The case of the gamma-proteobacteria. PLoS Biol. (2003) 1:101–109.[CrossRef][Web of Science]

    Margush T., McMorris F. Consensus n-trees. Bull. Math. Biol. (1981) 43:239–244.[Web of Science]

    Neumann D. A. Faithful consensus methods for n-trees. Math. Biosci. (1983) 63:271–287.[Web of Science]

    Page R. D. M. Modified mincut supertrees. Guigó R., Gusfield D., eds. (2002) Proceedings of the 2nd International Workshop on Algorithms in Bioinformatics (WABI1902). Rome, Italy: Springer. 537–552.

    Pisani D., Wilkinson M. Matrix representation with parsimony, taxonomic congruence, and total evidence. Syst. Biol. (2002) 51:151–155.[Free Full Text]

    Poux C., Chevret P., Huchon D., de Jong W. W., Douzery E. J. P. Arrival and diversification of caviomorph rodents and platyrrhine primates in South America. Syst. Biol. (2006) 55:228–244.[Abstract/Free Full Text]

    Poux C., Douzery E. J. P. Primate phylogeny, evolutionary rate variations, and divergence times: A contribution from the nuclear gene IRBP. Am. J. Phys. Anthropol. (2004) 124:1–16.[CrossRef][Web of Science][Medline]

    Purvis A. A composite estimate of primate phylogeny. Philos. Trans. R. Soc. Lond. B Biol. Sci. (1995a) 348:405–421.[Abstract/Free Full Text]

    Purvis A. A modification to Baum and Ragan's method for combining phylogenetic trees. Syst. Biol. (1995b) 44:251–255.[Free Full Text]

    Ragan M. A. Matrix representation in reconstructing phylogenetic relationships among the eukaryots. Biosystems (1992) 28:47–55.[CrossRef][Web of Science][Medline]

    Roos C., Schmitz J., Zischler H. Primate jumping genes elucidate strepsirrhine phylogeny. Proc. Natl. Acad. Sci. USA (2004) 101:10650–10654.[Abstract/Free Full Text]

    Ross H. A., Rodrigo A. G. An assessment of matrix representation with compatibility in supertree construction. In: Phylogenetic supertrees: Combining information to reveal the tree of life—Bininda-Emonds O. R. P., ed. (2004) The Netherlands: Kluwer. volume 4 of Computational biology series.

    Sanderson M. J., Purvis A., Henze C. Phylogenetic supertrees: Assembling the trees of life. Trends Ecol. Evol. (1998) 13:105–109.[CrossRef]

    Semple C., Steel M. A supertree method for rooted trees. Discrete Appl. Math. (2000) 105:147–158.[CrossRef]

    Semple C., Steel M. A. Phylogenetics, volume 24 of Oxford lecture series in mathematics and its applications (2003) Oxford, UK: Oxford University Press.

    Shimodaira H., Hasegawa M. Multiple comparisons of log-likelihoods with applications to phylogenetic inference. Mol. Biol. Evol. (1999) 16:1114–1116.[Web of Science]

    Singer S. S., Schmitz J., Schwiegk C., Zischler H. Molecular cladistic markers in New World monkey phylogeny (Platyrrhini, Primates). Mol. Phylogenet. Evol. (2003) 26:490–501.[CrossRef][Web of Science][Medline]

    Steel M. A. The complexity of reconstructing trees from qualitative characters and subtree. J. Classif. (1992) 9:91–116.[CrossRef]

    Steel M. A., Dress A. W. M., Böcker S. Simple but fundamental limitations on supertree and consensus tree methods. Syst. Biol. (2000) 49:363–368.[Free Full Text]

    Swofford D. L. PAUP*: Phylogenetic analysis using parsimony (*and other methods). Version 4b10 (2002) Sunderland, Massachusetts: Sinauer Associates.

    Thompson J. D., Gibson T. J., Plewniak F., Jeanmougin F., Higgins D. G. The clustalX windows interface: Flexible strategies for multiple sequence alignment aided by quality analysis tools. Nucleic Acids Res. (1997) 24:4876–4882.

    Thorley J. L., Wilkinson M. A view of supertrees methods. In: Bioconsensus—Janowitz M. F., Lapointe F.-J., McMorris F. R., Roberts F. S., eds. (2003) Providence, Rhode, Island: DIMACS American Mathematics Society. 185–194. volume 61 of Discrete mathematics and theoretical computer science.

    Wilkinson M. Common cladistic information and its consensus representation: Reduced Adams and reduced cladistic consensus trees and profiles. Syst. Biol. (1994) 43:343–368.[Abstract/Free Full Text]

    Wilkinson M., Pisani D., Cotton J. A., Corfe I. Measuring support and finding unsupported relationships in supertrees. Syst. Biol. (2005) 54:823–831.[Free Full Text]

    Wilkinson M., Thorley J., Pisani D., Lapointe F.-J., McInerney O. Some desiderata for liberal supertree. In: Phylogenetic supertrees: Combining information to reveal the tree of life—Bininda-Emonds O. R. P., ed. (2004) The Netherlands: Kluwer. 227–246. volume 4 of Computational biology series.

    Wilson D. E., Reeder D. M. Mammal species of the world (2005) Baltimore, Maryland: Johns Hopkins University Press.

    Xing J., Wang H., Han K., Ray D. A., Huang C. H., Chemnick L. G., Stewart C.-B., Disotell T. R., Ryder O. A., Batzer M. A. A mobile element based phylogeny of Old World monkeys. Mol. Phylogenet. Evol. (2005) 37:872–880.[CrossRef][Web of Science][Medline]


Add to CiteULike CiteULike   Add to Connotea Connotea   Add to Del.icio.us Del.icio.us    What's this?


This article has been cited by other articles:


Home page
Syst BiolHome page
J. H. Degnan, M. DeGiorgio, D. Bryant, and N. A. Rosenberg
Properties of Consensus Methods for Inferring Species Trees from Gene Trees
Syst Biol, June 4, 2009; (2009) syp008v1.
[Abstract] [Full Text] [PDF]


This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (PDF) Freely available
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in ISI Web of Science
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrow Search for citing articles in:
ISI Web of Science (4)
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Ranwez, V.
Right arrow Articles by Douzery, E. J. P.
Right arrow Search for Related Content
PubMed
Right arrow Articles by Ranwez, V.
Right arrow Articles by Douzery, E. J. P.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?