© 2006 Society of Systematic Biologists
Fast Computation of Supertrees for Compatible Phylogenies with Nested Taxa
Edited by Olaf Emonds: Associate Editor
1 Département Informatique, L.I.R.M.M.–C.N.R.S. 161 rue Ada, 34392 Montpellier Cedex 5, France E-mail: vberry{at}lirmm.fr
2 Biomathematics Research Centre, Department of Mathematics and Statistics, University of Canterbury Christchurch, New Zealand E-mail: c.semple{at}math.canterbury.ac.nz
| Abstract |
|---|
|
|
|---|
Typically, supertree methods combine a collection of source trees in which just the leaves are labeled by taxa. In such methods the resulting supertree is also leaf labeled. An underlying assumption in these methods is that across all trees in the collection, no two of the taxa are nested; for example, "buttercups" and "plants" are nested taxa. Motivated by Page, the first supertree algorithm for allowing the source trees to collectively have nested taxa is called ANCESTRAL BUILD. Here, in addition to taxa labeling the leaves, the source trees may have taxa labeling some of their interior nodes. Taxa-labeling interior nodes are at a higher taxonomic level than that of their descendants (for example, genera versus species). Analogous to the supertree method BUILD for deciding the compatibility of a collection of source trees in which just the leaves are labeled, ANCESTRAL BUILD is a polynomial-time algorithm for deciding the compatibility of a collection of source trees in which some of the interior nodes are also labeled by taxa. Although a more general method, in this paper we show that the original description of ANCESTRAL BUILD can be modified so that the running time is as fast as the current fastest running time for Build. Fast computation for deciding compatibility is essential if one is to make use of phylogenetic databases that contain thousands of trees on tens of thousands of taxa. This is particularly so as ANCESTRAL BUILD is incorporated as a basic tool inside more general supertree methods (that is, methods that always output a tree regardless of the compatibility of the source trees). We apply the method to propose a comprehensive phylogeny of the strepsirrhines, a major group of the primates.
Keywords: Compatibility; nested taxa; phylogenetics, strepsirrhine phylogeny; supertree methods
Received July 4, 2005; Revised September 9, 2005; Accepted November 4, 2005
Supertree methods are a fundamental and practical way of inferring phylogenies. Generally speaking, these methods amalgamate a collection of "source" trees on overlapping subsets of taxa into a single parent tree that contains the taxa of all of the source trees. This parent tree is called a supertree. This approach to constructing evolutionary trees is particularly appealing because it allows the inference of an evolutionary scenario from a combination of analyses differing in the set of taxa they encompass as well as in the primary data from which they were conducted (for example, molecular or morphological studies). The increasing popularity of these methods and the diversity of ways in which they can be used is highlighted in a recently published book (Bininda-Emonds, 2004).
Historically, supertree methods only make use of the leaf labels of the source trees, with the resulting supertree having the property that taxa are only attached at leaves. On this basis, supertree methods can deal with taxa at different taxonomic levels only in the case where the leaves of the source trees represent non-nested taxa (e.g., "elephant" and "mammal" should not be represented as two distinct leaves amongst the source trees as the former taxa is nested inside the latter taxa). However, in supertree studies this situation does occur, especially concerning outgroups. For example, a source tree containing strepsirrhine species could include Haplorrhini, a suborder, as an outgroup leaf, and another source tree on primates would include several haplorrhine species as leaves, whereas the latter are nested inside the former taxon. Because of the assumption that taxa of the source trees are non-nested, traditional supertree methods produce in that case an incorrect supertree. Thus, unless one resorts to preprocessing of source trees to remove nested taxa via taxonomic synonymy (Bininda-Emonds et al. 2004), traditional supertree methods have limited use in the case of nested taxa, as arises when one is to amalgamate trees from phylogenetic databases such as TreeBASE (Sanderson et al., 1993). This was recently highlighted by Page (2004). Indeed, TreeBASE incorporates trees from many different published phylogenetic studies that focus on a variety of different biological problems, and hence consider evolutionary relationships at different taxonomic levels.
As a consequence of this limitation, Page (2004) motivated the task of designing supertree methods that allow the interior nodes as well as all of the leaves to represent taxa in the source trees. In these nested-taxa source trees, an interior label corresponds to a taxon at a higher taxonomic level than any of its descendants. Note that interior labels can be either available initially in the source trees, or added to some source trees based on taxa present in other source trees at the time the collection of studied source trees is formed. The first computational problem that arises is to find a polynomial-time algorithm for deciding whether or not a collection of nested-taxa source trees is compatible and, if so, constructing an appropriate supertree.
In answer to this problem, Daniel and Semple (2004) provided an algorithm called AncestralBuild. This algorithm generalizes Build, one of the first supertree methods (Aho et al., 1981). The polynomial-time algorithm Build takes a collection
of rooted leaf-labeled trees as input and decides whether or not
is compatible, in which case it returns a leaf-labeled supertree that displays
. That is, the supertree preserves all of the relative groupings of taxa present in the source trees. For a collection of nested-taxa trees, if a supertree preserves all ancestral relationships as well as all groupings of taxa, then the supertree is said to ancestrally display this collection and the collection is said to be ancestrally compatible. These concepts are formally defined in the next section. However, we make a comment here on the way in which ancestor is used in this paper: by writing that a taxon tis an ancestor of a taxon t', we mean that t is either a hypothetical ancestral taxon of t' or that t is the name of a taxonomic grouping containing t', so that node labeled t is ancestral to the node labeled t'. The algorithm AncestralBuild takes a collection
of nested-taxa trees as input and outputs a supertree that ancestrally displays
if such a supertree exists; otherwise, it states that the collection is not ancestrally compatible. Though designed to handle trees containing taxa at both internal nodes and leaves, AncestralBuild also accepts collections of source trees that have taxa only at the leaves, because leaf-labeled trees are a special case of nested-taxa trees. In that particular case, AncestralBuild decides the compatibility of the source trees in the usual sense. Consequently, it does indeed generalize Build.
AncestralBuild has the desirable property to give an exact answer in polynomial time Daniel and Semple, 2004). However, there can be three objections to its use. First, for incompatible phylogenies to be combined, an all-or-nothing algorithm, that is, just stating the incompatibility when it arises, is not desirable. Second, even for the easiest case of source trees that are all fully resolved and have taxa only at the leaves, the running time of the version of AncestralBuild stated in Daniel and Semple (2004) is O(t2n3), where t is the number of source trees and n is the number of taxa. Despite being polynomial, this running time makes AncestralBuild a reasonably slow algorithm in practice, particularly if it is to handle the thousands of trees stored in databases such as TreeBASE. Third, in the absence of internal labels in the source trees, the method will build a nonsensical supertree, not knowing for instance that the "mammal" taxon met in one tree is an ancestor of "elephant," present in another tree. We answer these three objections below.
Inferring a Supertree for Incompatible Source Trees
Primary data sequences are getting longer and phylogenetic methods more accurate, so that a reasonable number of published phylogenies are compatible. For example, Llabrés et al. (2005) found a very low rate of incompatible trees in TreeBASE. However, it is still relatively frequent that one has to deal with incompatible source trees. For example, it is well known that gene trees may conflict with species trees (due to, e.g., lateral gene transfer or hybridization), and that some trees may contain erroneous branches inferred due to noisy information for some internal edges in the primary data. In the former case, one may want to construct a supertree to identify the disagreements between the gene trees and the species trees. In such cases, an all-or-nothing algorithm may not be enough and one can understandably prefer a more general supertree method that outputs a supertree that either conflicts with or omits some information present in the source trees. However, for any general supertree method, a basic property that one would always like is that of consistency; that is, if the source trees carry no conflicting information, then the supertree returned by the method displays each of the source trees. Because the property of consistency is such a compelling property, many general supertree methods dealing with leaf-labeled trees (respectively, nested-taxa trees) are likely either to have Build (respectively, AncestralBuild) as a subroutine or to be a variant of Build (respectively, AncestralBuild). Indeed, this is already the case for some general methods: both MinCutSupertree (Semple and Steel 2000) method and its modified version (Page, 2002) are variants of Build and, more recently, Daniel and Semple (2005) describe a class of general supertree methods for nested-taxa source trees that is a variant of AncestralBuild. (This class and, more particularly, the underlying general supertree method NestedSupertree are described further in the last section.) Moreover, these all-or-nothing algorithms can be repeatedly used in simple schemes to extract compatible parts out of a collection of incompatible source trees. We highlight two examples of such schemes in the discussion part of this paper.
Reasonable Running Time
Given the amount of information in current tree databases, it is not unreasonable to try to amalgamate hundreds of trees that collectively contain thousands of taxa and, consequently, fast algorithms are essential. To deal with fully resolved (i.e., binary) leaf-labeled trees, Henzinger et al. (1999) proposed a fast implementation of Build that runs in O(mn
) time, where m is the total sum of the number of nodes in each of the source trees and n is the total number of taxa. (They also proposed a variant of Build running in O(m+n2log n) time, but, in the case of compatibility, it typically resulted in a supertree with edges not supported by any part of the data, an undesirable feature for a supertree algorithm.) For comparison between this running time and the O(t2n3) running time of AncestralBuild, note that m can range from O(n) to O(tn). Thus, the running time of AncestralBuild as stated in Daniel and Semple (2004) is still relatively slow compared to the implementation provided by Henzinger et al. (1999) for BUILD in the special case of fully resolved leaf-labeled trees. However, comparing the gap between these running times provides hope for improving AncestralBuild on this aspect, despite the latter being a much more general algorithm, for it allows nested-taxa trees and trees that are partially resolved. In this paper we describe several modifications to AncestralBuild that greatly reduce its computational complexity. Some of these modifications have a similar flavor to techniques used in (Henzinger et al. 1999). The running time of AncestralBuild resulting from these modifications is as fast as the current fastest algorithm for source trees in which only the leaves are labeled Henzinger et al. 1999). More particularly, the achieved running time is almost linear in the size of the source trees, when these trees are fully resolved. Furthermore, extending the implementation of Henzinger et al. (1999) for Build in the most canonical and efficient way to allow partially resolved leaf-labeled trees in its input, we also show that the running time of our modification of AncestralBuild is the same as that for this extension of Build.
Having a Sensible Supertree as Output
In the case where nested taxa are present in different source trees but these trees contain no internal label, then, unless extra information is provided to recognize such nestings, there is no chance that AncestralBuild (or any supertree method) produces a sensible supertree. For example, if one tree contains "elephant" and another "mammal" without any tree indicating that the latter is an ancestor of the former, then the supertree output by the methods will have both taxa as tips. This situation arises because of lack of information in the source trees. Unfortunately, this will be a frequent situation in practice because already inferred trees are usually stored without internal labels, though this is allowed by the Newick format. In such a case, all that is required for the method to produce a sensible supertree is a (partial) reference taxonomy indicating the nestings between taxa of interest. This taxonomy can be included as one extra source tree or be divided into several extra very simple trees of two vertices, where an ancestor taxon is above one of its descendant taxon. This reference taxonomy can be either built by hand, or generated automatically by a program searching taxon names in Web-accessible taxonomies, like NCBI's Taxonomy Browser (http://www.ncbi.nlm.nih.gov/entrez) or the Tree of Life website (http://tolweb.org/tree). See Page and Valiente (2005) and Hibbett et al. (2005) for programs filling aims close to that one.
To describe an application of AncestralBuild, we studied the evolutionary relationships of the Strepsirrhini, one of the two major primate groups. Monophyly of this group is widely accepted but some intrasubordinal relationships were still recently under debate. Recent studies investigating strepsirrhine phylogeny involve quite different kinds of data, including mtDNA sequences (Yoder et al., 2000), craniodental morphological data (Masters and Brothers, 2002), and retroposon analysis (Roos et al., 2004). We applied AncestralBuild to four phylogenies published in these papers, covering many different taxonomic levels from order to individuals. The method shows that these phylogenies are ancestrally compatible and provides as a result a comprehensive supertree of the Strepsirrhini. This highlights, once again, the interest in the supertree approach, which allows us to combine phylogenies inferred from different kinds of primary data.
The rest of the paper is organized as follows. The next section contains some preliminaries that are used throughout the paper. The following three sections contain a description of ANCESTRALBUILD as stated in Daniel and Semple (2004) and descriptions of our modifications of this algorithm as well as the running time of this modification. An application of AncestralBuild to obtain a strepsirrhine supertree is described in the following section. In addition to deciding compatibility, AncestralBuild can be used in a variety of ways for constructing supertrees from incompatible source trees. We highlight some of these ways in the last section. Finally, the appendix contains the proof that establishes that our modification of AncestralBuild does indeed decide ancestral compatibility for a collection of nested-taxa trees as well as a pseudocode description of this algorithm.
We end the introduction by noting that a novel approach for testing the ancestral compatibility of two rooted semilabeled trees has been recently proposed by Llabrés et al. (2005).
| Preliminaries |
|---|
|
|
|---|
In this section, we describe some concepts that are frequently used in the paper. For further details, we refer the interested reader to Semple and Steel (2003).
Phylogenies
The degree of a node v in a graph (or, in particular, a tree) is the number of edges incident with v. We denote the degree of v by d(v). Essentially, a rooted phylogenetic X-tree is a rooted tree whose leaves are labeled with the elements of a set X of taxa. We formally define it as follows. A rooted phylogenetic tree (on X) is an ordered pair (T;
) consisting of a rooted tree T in which all interior nodes have degree at least 3, except the root, which has degree at least 2, and a map
from X to the leaf set of T that assigns each element in X to a leaf in T so that no two elements are assigned the same leaf in T and each leaf of T is assigned an element of X. A rooted phylogenetic tree is fully resolved or binary if the root of T has degree 2 and all other interior nodes of T have degree 3.
The concept of rooted phylogenetic trees naturally extends to trees in which some of the interior nodes are also labeled by taxa. Such trees, called nested-taxa trees or rooted semilabeled trees, are used to display in the same tree taxa at different taxonomic levels. More precisely, a rooted semilabeled tree
on a taxa set X is an ordered pair (T;
) consisting of a rooted tree T with root node
, and a map
from X into the node set V of T such that for all nonroot nodes v of degree at most 2,
assigns v an element of X, and if
has degree 0 or 1, then
also assigns
an element of X. If every node of T is assigned at most one taxa of X under
, then
is said to be singularly labeled. Furthermore, if
is singularly labeled and the degree of any node in T is at most 3, except for the root, which has degree at most 2, then we say that
is binary. In Figure 1, each of
1,
2, and
3 is a rooted semilabeled tree, but
3 is the only rooted semilabeled tree that is binary.
|
Remarks
- We will often write a rooted semilabeled X-tree for a rooted semilabeled tree on X.
- Observe that rooted phylogenetic trees are special types of rooted semilabeled trees.
- To simplify matters and because we see no practical reason for nodes in the source trees to be assigned more than one taxa of X, we will assume throughout the paper that all rooted semilabeled trees that are source trees are singularly labeled. However, we note that the upgrade of the results in this paper to nonsingular rooted semilabeled trees is straightforward. Note that this remark about singular labeling does not apply to the output tree, where it is quite possible that some nodes are labeled with more than one taxa. For instance, a node joining human and chimp on a source tree that contains no other mammals could be equally labeled as anything from "hominoid" to "primate" to "mammal." This multiple listing of a node then becomes very important if all these names appear on other source trees and there is no user-input taxonomy to sort this out.
Let
= (T;
) be a rooted semilabeled tree on X. The set X is called the label set of
and we call the elements of Xlabels. We also use
(
) to denote the label set of
. For example, in Figure 1,
(
1) = {c,d,e,f,x,y}. For a node v of T, we denote the set of elements of X that are assigned to v by
–1(v) and say that the elements in
–1(v) labelv. Furthermore,
is fully labeled if every node of T is labeled by an element of X. For a collection
of rooted semilabeled trees, we denote the union of the label sets of the trees in
by
(
). Thus, for example, for the collection
of rooted semilabeled trees shown in Figure 1, we have
(
) = {a,b,c,d,e,f,x,y,z}.
Let
= (T;
) be a rooted semilabeled tree, and let a, b
(
). We say that a is a descendant (label) of b (or, alternatively, b is an ancestor [label] of a) if the path from the root of T to
(a) includes
(b). Furthermore, a and b are not comparable if neither a is a descendant of b nor b is a descendant of a. Now suppose that a and b are not comparable in
. Then the node of T that is the last common node on the paths from the root of T to
(a) and from the root of T to
(b) is called the most recent common ancestor of a and b, and is denoted mrca
(a,b).
Lastly, throughout the paper, for a rooted semilabeled tree
, we will use |
| to denote the number of nodes in
. Furthermore, for a collection
of rooted semilabeled trees, we will use m to denote 
![]()
|
|.
Compatibility
We will describe the notion of compatibility for rooted phylogenetic trees first, before extending this notion to rooted semilabeled trees.
Let
be a rooted phylogenetic tree on X and let
' be a rooted phylogenetic tree on X', where X is a subset of X'. We say that
' displays
if, up to suppressing all nonroot nodes of degree 2, the minimal rooted subtree of
' that connects the elements in X is a refinement of
, that is,
can be obtained from it by contracting edges. Suppressing a degree 2 node v means replacing v and its two incident edges with a single edge. Observe that the use of "refinement" means that we allow for the resolution of soft polytomies in the definition of displays, where we recall that a soft polytomy represents an uncertainty as to the exact order of speciation as opposed to a certainty of a multiple speciation event. A collection
of rooted phylogenetic trees are compatible if there is a rooted phylogenetic tree
that simultaneously displays each of the trees in
, in which case we say that
displays
.
The notion of displays for rooted semilabeled trees extends the notion of displays for rooted phylogenetic trees. In particular, let
be a rooted semilabeled X-tree and let
' be a rooted semilabeled X'-tree, where X is a subset of X'. Then
' ancestrally displays
if, up to suppressing nonroot nodes of degree 2, the minimal rooted subtree of
' that connects the elements in X is a refinement of
and, for all a,b
X, whenever a is a descendant label of b in
, we have that a is a descendant label of b in this rooted subtree. A collection
of rooted semilabeled trees is ancestrally compatible if there is a rooted semilabeled tree
that ancestrally displays each of the trees in
, in which case we say that
ancestrally displays
. To illustrate this notion, the collection of trees in Figure 1 is ancestrally compatible, as each of its trees is ancestrally displayed by the rooted semilabeled tree shown in Figure 2.
|
It is important to note that a collection of rooted semi-labeled trees may be ancestrally incompatible because of their interior labels and not their leaf labels. This could be as simple as a contradiction on an ancestor-descendant relationship, or as complex as the incompatibility of a set of most recent common ancestor relationships.
Graphs and Digraphs
Let G be a graph with node set V. We frequently use the notation {u,v} to denote the edge joining the nodes u and v. Furthermore, the connected components of G are the subgraphs of G such that, for all u,v
V, u and v are in the same connected component if and only if there is a path from u to v.
A directed graph, also called a digraph, is simply a graph in which, instead of having edges joining two nodes, we have directed edges; that is, edges directed from one node to another. Directed edges are also called arcs. We use the notation (u,v) to denote the arc directed from u to v. For a directed graph D, the indegree of a node v is the number of arcs directed into v and the outdegree of v is the number of arcs directed out of v. Analogous to the connected components of an ordinary graph, the (connected) arc components of D are the subdigraphs of D such that, for all u,v
V, u and v are in the same arc component if and only if, ignoring the direction of the arcs, there is a path between u and v.
For the purposes of this paper, we say that a graph is mixed if it contains both arcs and edges. An arc component of a mixed graph is an arc component of the digraph obtained when masking the edges of this graph.
Let D be a (mixed) graph, let v be a node of D, and let U be a subset of the set of nodes of D. The (mixed) graph obtained from D by deleting v and each of its incident arcs and edges is denoted by Dv. In general, we use DU to denote the graph obtained from D by deleting each of the nodes in U. Furthermore, if V is the node set of D, then the restriction of D to U (also called the subgraph of D induced by U) is the subgraph of D that is obtained by deleting each of the nodes in V–U. This subgraph is denoted by D|U.
Finally, one typically views a rooted tree as an undirected graph. However, it will often be convenient in this paper to view a rooted tree as a directed graph where each edge is replaced with an arc directed away from the root.
| Ancestralbuild |
|---|
|
|
|---|
For completeness and to make a comparison of the modifications we describe in this paper, in this section we give a full description of AncestralBuild as it is stated in Daniel and Semple (2004).
We begin by describing a construction on a collection of rooted semilabeled trees, and a particular mixed graph on a collection of rooted fully labeled trees. Both the construction and mixed graph are central to AncestralBuild.
Let
= (T;
) be a rooted semilabeled tree on X, where T has node set V. We say that a rooted fully labeled tree
' has been obtained from
by adding distinct new labels if, for each node of T that is not assigned a label under
, we assign it an arbitrary label not in X so that no two new labels are the same. In general, if
is a collection of rooted semilabeled trees, we say that
' has been obtained from
by adding distinct new labels if it has been obtained by adding distinct new labels to each tree in
so that across all trees in
' no two new labels are the same.
Example 1. To illustrate the above construction, let
be the collection of rooted semilabeled trees shown in Figure 1. The collection
' shown in Figure 3 has been obtained from
by adding distinct new labels. Note that the added labels only need to be unique, and not as involved as the ones shown in Figure 3. The reason for choosing these particular labels will be made clear in the next section.
|
Let
|
|
|
|
|
We now describe AncestralBuild. All of the work in the algorithm is performed by a subroutine called Descendant, which decides the ancestral compatibility of a collection
- Algorithm: ANCESTRALBUILD (
)
- Input: A collection
of rooted semilabeled trees on X.
- Output: A rooted semilabeled tree
that ancestrally displays
or the statement
is not ancestrally compatible.- Construct a collection
' of rooted fully labeled trees from
by adding distinct new labels to the unlabeled nodes in the trees of the collection.
- Construct the descendancy graph D(
') of
'.
- Call the subroutine Descendant (D(
')).
- If DESCENDANT returns no possible labeling, then return
is not ancestrally compatible. Otherwise, return the semilabeled tree
' returned by DESCENDANT with the added labels removed.
- Input: A collection
- Algorithm: Descendant (D(
'))
- Input: The descendancy graph of a collection
' of rooted fully labeled trees.
- Output: A rooted fully labeled tree
' with root node v' that ancestrally displays
' or the statement no possible labeling.- Let S0 denote the set of nodes of D(
') that have indegree zero and no incident edges. If S0 is empty, then halt and return no possible labeling.
- If S0 comprises of exactly one node labeled
with outdegree 0, then return the tree composed of just one leaf labeled
.
- Otherwise,
- Delete the elements of S0 (and their incident arcs) from D(
') and denote the resulting graph by D(
')\
0.
- Find the node sets S1,
2, ...,
k of the arc components of D(
')\
0.
- Delete all edges of D(
')\
0 whose end nodes are in distinct arc components of this graph.
- Delete the elements of S0 (and their incident arcs) from D(
- For each element i
{1,2,...,k}, call DESCENDANT (D(
')|
i). If any of these calls return no possible labeling, then return this message. Otherwise, return the tree whose root node is labeled by S0 and which has
1',...,
k' (the trees returned by the recursive calls) as child subtrees.
- Input: The descendancy graph of a collection
Remark
- With respect to descendancy, the added labels act as necessary "place holders" for unlabeled nodes.
- The recursive calls performed at Step 4 in DESCENDANT consider disjoint node induced subgraphs so that the processes applied to these subgraphs in subsequent iterations are independent from one subgraph to another.
Example 2. As an example of AncestralBuild applied to a collection of rooted semilabeled trees, consider the collection
of trees shown in Figure 1. Suppose that Step 1 constructs the collection
' of rooted fully labeled trees shown in Figure 3. Now Step 2 builds the descendancy graph D(
') as shown in Figure 4. On the first iteration of DESCENDANT, Step 1 finds mrca
1(x,y) and mrca
2(e,z) as the only nodes of D(
') that have indegree zero and no incident edges. Deleting these elements in Step 3 results in the creation of two arc components, one containing nodes a, b, and x and the other containing the nine remaining nodes. Recursive calls to DESCENDANT investigate these two components separately. At Step 4 of the initial call to DESCENDANT, the subtrees returned by the two recursive calls are used as child subtrees of the root of the tree returned there. The root of this tree, which is shown in Figure 5, is labeled by S0 = {mrca
1(x,y), mrca
2(e,z)}. All added labels are eventually removed in Step 4 of AncestralBuild. The final tree returned by AncestralBuild applied to
is shown in Figure 2.
|
The running time of AncestralBuild as it is stated above (and thus in Daniel and Semple [2004]) is given in Proposition 3.
Proposition 3. Let
be a collection of rooted semilabeled trees with |
| = t and |
(
)| = n. Then ANCESTRALBUILD (
) runs in time O(t2n3).
Proof
Recall that m = 
![]()

|
|. First note that the descendancy graph D(
) contains O(m) nodes and O(tn2) arcs and edges. In the worst case, every execution of DESCENDANT removes only one node in D(
), in which case the subroutine is executed O(m) times. The computation time is dominated by the cost of Steps 3(b) and 3(c) in the subroutine. Finding the connected arc components of a digraph is linear in the number of its nodes and arcs. Thus, assuming that in the worst case only a constant number of edges are removed with each node, an execution of Step 3(b) can require up to O(tn2) time to process the restriction of D(
) it is considering. Because of the m executions of the subroutine, this leads to an overall running time of O(mtn2) for Step 3(b). Finding edges across different arc components in Step 3(c) can necessitate at worst to examine the O(tn2) edges of the graph. This leads to an overall running time of O(mtn2) for Step 3(c). Noting that m = O(tn) gives the final result.
Remark
It is worth noting that the running time of AncestralBuild does not improve when each of the trees in
are fully resolved and phylogenetic. Indeed, the descendancy graph still has O(tn2) arcs and edges in such cases.
Daniel and Semple (2004) were only interested in making sure that AncestralBuild runs in time polynomial in the size of the input. Indeed, other than providing a brief check to note that it is polynomial, no consideration to the actual running time is given. Consequently, some simple and not-so simple improvements to the algorithm were overlooked. In the next section we show how this running time can be reduced to almost linear running time. This improvement results from three changes in the algorithm: (i) drastically reducing the number of arcs and edges in the descendancy graph; (ii) using an ad hoc graph of smaller size to compute the arc components of the various restrictions of the descendancy graph and identify edges between them; (iii) working these graph restrictions as actual subgraphs of the initial graph (and not as copies of bits of it), while maintaining node-connectivity through an efficient dynamic data structure. This data structure facilitates the discovery of new arc components resulting from edge deletions.
| Improving the Running Time of Ancestralbuild |
|---|
|
|
|---|
In this section we describe our modification of AncestralBuild.
Reducing the Size of the Descendancy Graph
We first show that a large amount of information included in the descendancy graph is redundant in the sense that the correctness of the algorithm is maintained when using a restricted version of this graph. For a collection
' of rooted fully labeled trees, let D*(
') be the graph having the same node set as D(
'), but whose arc and edge sets are
|
|
|
|
|
Proposition 4. Let
- a rooted semi labeled tree that ancestrally displays
if
is ancestrally compatible, or
- the statement
is not ancestrally compatible otherwise.
The statement of Proposition 4 is the same statement as Theorem 4.1 (Daniel and Semple, 2004), but without the proviso on using the restricted descendancy graph. Thus, not surprisingly, the proof of Proposition 4 is very similar to their proof. Consequently, to avoid repetition, we refer to parts of the latter proof where appropriate.
Proof of Proposition 4
We first note that, as in the proof of Theorem 4.1 (Daniel and Semple, 2004), it suffices to show that the result holds when
is a collection of rooted fully labeled trees. The proof of (i) is the same as the proof of Theorem 4.1 (Daniel and Semple, 2004).
For the proof of (ii), suppose that AncestralBuild (using the restricted descendancy graph) outputs a rooted semilabeled tree
'. We show that
' ancestrally displays
. Let
1 be an element of
. By Lemma 2.1 (Bordewich et al., 2005), it suffices to show, for all a,b
(
1) that (I) if a is a descendant label of b in
1, then a is a descendant label of b in
', and (II) if a and b are non-comparable in
1, then a and b are noncomparable in
'.
The argument for (I) is very similar to the corresponding argument in the proof of Theorem 4.1(ii) (Daniel and Semple, 2004), and so we omit it. To show (II), suppose that a and b are not comparable in
1. Assume that
1 = (T1;
1). Let v be the node in T1 that is the most recent common ancestor of
1(a) and
1(b). By the construction of D*(
), there is a pair of children, c and d say, of the label labeling v in T1 such that c and d are joined by an edge, and c is an ancestor label of a, and d is an ancestor label of b. Because we eventually output a tree, this edge is eventually deleted, but not until c and d, and hence a and b are in separate arc components of some restriction of D*(
). It now follows that a and b are not comparable in
'.
Remark
Let
be a collection of rooted semilabeled trees with |
| = t and |
(
)| = n, and let
' be a collection of fully labeled trees that is obtained from
by adding distinct new labels. Let m = 
![]()

|
|. Then the mixed graph D*(
') contains O(m) nodes and arcs. However, the number e of edges in D*(
') is a function of the degree of the nodes in the source trees. In particular, D*(
') contains
|
|
| Computing the Arc Components Via a Smaller Graph |
|---|
|
|
|---|
Despite the obvious improvements given by Proposition 4, it is Step 3(b) (finding the arc components) and, to a lesser extent, Step 3(c) (finding the edges joining distinct arc components) that have the biggest influence on the running time of AncestralBuild because these steps are performed a high number of times on relatively large graphs. To speed up these parts of the algorithm, we introduce an additional graph (which we call the "component graph"). The reason for this graph is that identifying the arc components of the restricted descendancy graph can be reduced to identifying the connected components in the component graph. The component graph is typically smaller than the restricted descendancy graph, and it is this smallness that provides the improvement in the running time of the algorithm. To describe the component graph, we first need to some additional concepts.
Let
= (T;
) be a rooted semilabeled tree on X, where T has node set V. We say that
' is obtained from
by adding most recent common ancestor labels if, for each node z of degree at least three in which
–1(z) is empty, we assign the label mrca
'(a,b) to z, where a,b
(
) and z is the most recent common ancestor of
(a) and
(b). By choosing leaf labels if necessary, we can always find appropriate choices for a and b. Because each of the newly added labels are distinct, this construction is a special case of adding distinct new labels. In the paper, we often view the label mrca
'(a,b) as the set {a,b} and freely move between the two viewpoints. Note that the choice of a and b need not be unique, however, which choice is made is irrelevant. In general, if
is a collection of rooted semilabeled trees, we say that
' has been obtained from
by adding most recent common ancestor labels if it has been obtained by adding most recent common ancestor labels to each tree in
. Note that the newly added labels across
' are distinct as every most recent common ancestor label refers to a particular tree in
.
Example 5. To illustrate the last construction, the collection
' of rooted semilabeled trees shown in Figure 3 has been obtained from the collection
of rooted semilabeled trees shown in Figure 1 by adding most recent common ancestors labels. For instance, the label mrca
1(x,y) is assigned to the node of
1 that is the most recent common ancestor of nodes labeled x and y.
Let
be a collection of rooted semilabeled trees on X, and let
' be a collection of rooted fully labeled trees obtained from
by adding most recent common ancestor labels. To describe the component graph of
', we simultaneously consider the restricted descendancy graph of
'. For a tree T and a node z in T, we say that u and v are siblings if both u and v are distinct children of z. Let
= (T;
) be an element of
', and let u and v be siblings of a node z in T, and consider
–1(u) and
–1(v). By the definition of the restricted descendancy graph of
', we have that
–1(u) and
–1(v) are joined by an edge e in D*(
'). (Note that all edges of D*(
') are derived from a tree in
' in this way.) Furthermore, in D*(
'), there is an arc from
–1(z) to
–1(u) and an arc from
–1(z) to
–1(v). Relative to
, we call the set
|
|
–1(z), we call it a sibling edge set of
–1(z). Moreover,
–1(z) is the parent node of this edge set.
The component graph of
', denoted C(
'), has node set X and two types of edges. In particular, for two nodes a and b, there is
- an unlabeled edge joining a and b precisely if there is a tree

with b an ancestor of a and no element x in
(
)–{a,b} such that b is an ancestor of x and x is an ancestor of a; and
- an edge joining a and b labeled (
, e) precisely if, relative to a tree 
' with label
, we have that {a,b} is in a sibling edge set of
and an edge e in D*(
').
Example 6.Figure 7 shows the component graph C(
') for the collection
' of rooted fully labeled trees shown in Figure 3, where type (i) edges are represented as dashed edges and type (ii) edges are represented as solid edges. For clarity, type (ii) edges are not labeled. To illustrate, {z,y} is a type (i) edge resulting from the fact that z is an ancestor of y in the tree
2' of
'; the fact that {z,e,x} are siblings in this tree results in the three type (ii) edges {z,e}, {z,x}, and {e,x} in
. Viewing mrca
1(e,y) as {e,y}, the second edge {e,x} joining e and x results from the fact that mrca
1(e,y) and x are siblings in
1', which generates the sibling edge set {{e,x},{y,x}} in
. Thus one of the edges joining e and x is labeled (mrca
2(e,z), {e,x}), whereas the other edge joining e and x is labeled (mrca
1(x,y), {mrca
1(e,y),x}).
|
Remark
Asymptotically, the component graph
At last we describe our full modification of AncestralBuild called AncestralBuild*. Analogous to Descendant in AncestralBuild, the algorithm AncestralBuild* includes a subroutine that we call Descendant*. Intuitively, apart from using the restricted descendancy graph instead of the descendancy graph, the main difference is that all of the work in finding the arc components and deciding which edges join two different arc components at each iteration of the subroutine is now done by the component graph and its various node-induced subgraphs. In the modification, all of the edges of the component graph are initially colored blue. In association with the component graph (or any of its node induced subgraphs), a blue component is a connected component of the graph obtained when masking non-blue edges. To describe AncestralBuilds, we highlight the changes to AncestralBuild:
- In Step 1,
' is now obtained from
by adding most recent common ancestor labels.
- Step 2 is replaced with the constructions of the restricted descendancy graph D*(
') and the component graph C(
') of
'.
- The input to the subroutine Descendant is initially D*(
) and C(
'). For recursive calls to Descendant (Step 4), the input is
|
i and C(
')|(
i
X).
- Step 3 of Descendant is replaced with Step 3'.
- Otherwise,
- Delete the elements of S0 (and their incident arcs) from D*(
') and denote the resulting graph by
\
0.
- Delete the elements of S0
X (and their incident edges) from C(
') and denote the resulting graph by C(
') (
0
; X).
- For every element of S0, colour all edges in each of its sibling edge sets in C(
') (
0
X) red.
- Delete the elements of S0 (and their incident arcs) from D*(
- Find the node sets
1,
2,...,
k of the blue components of C(
')\(
0
X). The arc components of
\
0 are S1,
2,...,
k, where Si
X =
i for all i.
- Delete all red edges joining two different blue components of C(
')\(
0
X). For each sibling edge set that is deleted, delete the corresponding edge in
\
0.
Remarks
In the following remarks and in the next section, we will assume for reasons of convenience that the input to DESCENDANT* is always D*(
') and Cgraph. This is in name only. (Strictly speaking, the input of recursive calls are node induced subgraphs of these graphs.)
- At the beginning and at the end of each iteration of DESCENDANT*, the node sets of the blue components of Cgraph correspond to the node sets of the arc components of D*(
').
- Although deleting the elements in S0 and their incident edges in D*(
') has the potential to create arc components A1,A2,...,Ak, deleting the elements in S0
X in C(
') will not create any blue components. This is because the sibling edges corresponding to the elements in S0 are still colored blue in the resulting subgraph of C(
'). However, Step 3'(a)(iii) colors these sibling edges red and it is this recoloring that reestablishes the correspondence described in Step 3'(b).
- The fact that the arc components of D*(
) correspond to the blue components of C(
')\(
0
X) as stated in Step 3'(b) is established in Lemma 12.
- Referring to Step 3'(c) of DESCENDANT*, the set of red edges that are deleted is a union of sibling edge sets. Again this is established in Lemma 12.
Example 7. Consider the execution of ANCESTRALBUILDS* on the collection of rooted semilabeled trees shown in Figure 1 which in Step 1 constructs the collection
' of rooted fully-labelled trees shown in Figure 3 by adding most recent common ancestors. Refer to Figure 4 and 7, respectively, for D*(
') and Cgraph, which are constructed in Step 2.
At the first call of DESCENDANT*, Step 3'(a)(i) deletes the nodes in
|
|
Together with Proposition 4, the correctness of ANCESTRALBUILDS* is established in Appendix 1. In particular, we have the following theorem.
Theorem 8. Let
be a collection of rooted semilabeled trees. Then, applying the algorithm ANCESTRALBUILDS* to
returns either
- a rooted semilabeled tree that ancestrally displays
if
is ancestrally compatible, or
- the statement
is not ancestrally compatible otherwise.
| Using a Dynamic Data Structure to Reduce the Computation Time of Connected Components |
|---|
|
|
|---|
Different calls to the subroutine DESCENDANT* consider different restrictions of the initial component graph
As shown in Henzinger et al. (1999) for a similar problem, this problem greatly simplifies if connected components of the graph are maintained in a separate data structure handled by a dynamic connectivity algorithm. This ad hoc data structure ensures that connectivity queries on the graph (i.e., asking whether two given nodes are in the same component) and updates of the graph (here, removing an edge) are performed efficiently. For example, the dynamic algorithm of Holm et al. (1998) supports each connectivity query in O(log n/log log n) and each edge deletion update in O(log 2n), where n is the number of nodes of the graph. Many implementations of dynamic connectivity algorithms have been proposed, and we refer the reader to Zaroliagis (2002) for a recent survey and experimental comparison of their running times. The next section details how the dynamic data structure comes into play in the execution of Steps 3'(b) and 3'(c) of DESCENDANT*.
| Complexity of Ancestralbuild* |
|---|
|
|
|---|
In this section, we establish the running time of ANCESTRALBUILDS*. In addition to the implementation details given here, more concrete details can be found in Appendix 2.
For a collection
of rooted semilabeled trees, the burden of the computation in ANCESTRALBUILD*\(
) lies in the subroutine DESCENDANT* and, more particularly, in Steps 3'(b) and 3'(c) when dealing with the computations in the graph
. For the exposition that follows, let n be the number of nodes and e' be the number of edges of
.
Step 3'(b) identifies the blue components of Cgraph\(
0
X). At the beginning of a call to DESCENDANT*, there is only one blue component, and then new ones result from the deletion of nodes and the change of color of edges performed in Step 3'(a) of the ongoing call. As already remarked, these new blue components arise more precisely at Step 3'(a)(iii). To determine resulting new blue components, Step 3'(b) sends one by one deletion updates to the dynamic algorithm, for each of the edges {a,b} turned red in Step 3'(a)(iii). After each such deletion update, a connectivity query is issued to the dynamic connectivity algorithm, to check whether a and b are still in the same component. If the answer is negative, then turning red this edge resulted in the division of a blue component C in
into two blue components, Ca and Cb say, where a is in the node set of Ca and b is in the node set of Cb. Starting with a and b, the other nodes of these components are then discovered by examining (via blue edges) neighbors of nodes that are in Ca and Cb, respectively. This examination processes each new node alternatively for Ca and Cb, and halts as soon as all nodes of the smallest of the two components have been found. This small component is considered new and the other is considered to be the original component C having lost some nodes. This technique, due to Even and Shiloach (1981), guarantees that each of the e' edges in the graph belongs to a new component at most log n times over all executions of Step 3'(b). This bounds the number of times an edge is examined whilst it is blue. Moreover, the overall number of deletion updates and connectivity queries issued to the dynamic algorithm is proportional to the number of edges initially in the graph, i.e., O(e').
Step 3'(c) identifies and deletes all red edges joining two distinct blue components of
\(
0
X). This is done by examining red edges separating new (hence small) blue components from other blue components. To identify these edges, all red edges incident to a node in a new blue component are examined. Those edges that are not separating two blue components are ignored (they will be removed at a later stage); those separating two blue components are removed from Cgraph\(
0
X). Note that it is possible to distinguish between these two situations in constant time without issuing a connectivity query to the dynamic algorithm. It suffices to associate a number to each blue component and to maintain in Step 3'(b) a table indicating for each node of
the number of the blue component to which it currently belongs. Because new blue components are small (i.e., contain at most half of the nodes of the component from which they originate), each red edge is examined at most log n times before being deleted from the graph. Due to Henzinger et al. (1999), this technique leads to an overall running time of O(e'log n) to delete red edges between blue components over all executions of Step 3'(c).
Lemma 9. Let
be a collection of rooted semilabeled trees with |
(
)| = n. Performing Steps 3 < eqid8 > (b) and 3 < eqid9 > (c) over all executions of DESCENDANT* < eqid10 >
,
) during algorithm ANCESTRALBUILD*(
) costs O(elog 2n) running time, where e is the initial number of edges in D*(
').
Proof
Let e' be the initial number of edges in
. As stated above, computing the blue components of C(
') over all executions of Step 3'(b) necessitates O(e'log n) operations on the graph
plus O(e') deletion updates and connectivity queries to a dynamic algorithm maintaining connectivity between nodes in
through an ad hoc data structure. Using the dynamic algorithm of Holm et al. (1998) leads to an O(e'log 2n) running time for all executions of this step.
As also stated above, each of the e' edges of C(
') is investigated at most log n times during all executions of Step 3(c) with each such investigation being done in constant time. Thus, removing red edges between blue components globally requires O(e'log n) time. Now edges of D*(
') that correspond to these red edges of
are known immediately because of pointers that are maintained between each edge of D*(
') and its associated sibling edge set in C(
'). Thus, when all edges in a sibling edge set have been removed from
, removing the corresponding edge in D*(
') is done in constant time. As there are O(e) edges in D*(
'), this requires O(e) time over the whole execution of ANCESTRALBUILDS*.
Hence, Step 3'(b) is the most time consuming, and noting that e' = O(e) gives the stated result.
Because Steps 3'(b) and 3'(c) of the subroutine DESCENDANT* are the most time consuming steps during an execution of AncestralBuilds, Theorem 10 is an immediate consequence of Lemma 9.
Theorem 10. Let
be a collection of rooted semilabeled trees with |
(
)| = n. Then ANCESTRALBUILD*(
) runs in time
|
|
Remark
In general, ANCESTRALBUILDS* allows for the source trees to be rooted semilabeled trees of unbounded degree. However, in the special case where the source trees are all rooted binary semilabeled trees, the running time of ANCESTRALBUILDS* is O(mlog 2n). This running time is the same as the running time of the algorithm in Henzinger et al. (1999) whose source trees are all rooted binary phylogenetic trees. (The running time of this algorithm can be improved from O(mn
) as stated in Henzinger et al. [1999] to O(mlog 2n) by changing the dynamic connectivity algorithm it resorts to.) This running time is almost linear in the size of the input, which guarantees short execution times even on large data sets.
| A Comparison of Running Times for Partially Resolved Trees |
|---|
|
|
|---|
Let


u
I(Ti)d(u)2. The reason for this factor is that if
in
is quadratic in the number of siblings. Unfortunately, given our current approach, there appears to be no way to remedy this. To see this, suppose that one can always choose a linear number of such edges. We will assume that this choice is independent amongst the trees in
A rooted triple is a rooted phylogenetic tree that has two interior nodes and whose label set has size three. We denote the rooted triple
with label set {a,b,c} by ab|c if the path from a to b does not intersect the path from the root to c.
Let
= {
1,
2}, where
1 is the rooted phylogenetic tree shown in Figure 8 and
2 is a rooted triple that will be described shortly. Suppose that ANCESTRALBUILD*; is applied to
with the linearity condition described above. Let
' be the collection of rooted fully labeled trees obtained in Step 1 of ANCESTRALBUILD*. Because of our assumption, the number of pairs of elements of
|
|
|
|
The running-time dependency on partially resolved trees may possibly be inherent in any supertree method for determining compatibility of the input collection, even in the case this collection consists of just leaf-labeled trees. To make a comparison with the running time in Theorem 10, we examine what appears to be the most canonical and natural extension of the algorithm in Henzinger et al. (1999) for deciding the compatibility of a collection
To aid the running time of the algorithm in Henzinger et al. (1999), the source trees in
are encoded as a collection of rooted triples. The resulting collection is displayed by a rooted phylogenetic tree
if and only if
is displayed by
. Extending this to a collection of source trees that contain arbitrary rooted phylogenetic trees, the minimum number of rooted triples required for the encoding is O(
Ti
![]()
u
I(Ti)d(u)2) (Grünewald et al., 2005). The complexity of the resulting algorithm would be the same as the one stated in Theorem 10 for ANCESTRALBUILDS*.
| An Example on Primates |
|---|
|
|
|---|
As an application of ANCESTRALBUILD*, we now consider the phylogeny of Strepsirrhini, one of the two major groups of primates. To infer this phylogeny, we use the ability of supertree methods to indirectly combine data of different kinds and ANCESTRALBUILD* to combine a set of source trees with nested taxa. The supertree is obtained from four source trees deriving from (i) retroposon data (Roos et al., 2004); (ii) morphological data (Masters and Brothers, 2002); and (iii) molecular data (Yoder et al., 2000) (see Fig. 10).
|
The first source tree (a) has been obtained from retroposon analyses (Roos et al., 2004, Fig. 2), namely from 61 loci containing short, interspersed elements (SINEs), translated into a presence-absence pattern at orthologous loci on 21 strepsirrhine species. These data contain no homoplasy and indicate unambiguously a unique tree. However, this tree does not resolve all phylogenetic relationships of the strepsirrhines: it exhibits a trifurcation involving Galagoides,Otolemur, and Galago, as well as a second trifurcation involving the three groups of the Lemuroidea (Lepilemur, Cheirogaleidae, and the group composed of Lemuridae and Indridae).
The second source tree (b) contains 13 species of the Galagonidae family and has been inferred from craniodental morphological data (Masters and Brothers, 2002; fig. 6a). This tree resolves the first trifurcation mentioned above. As it is the strict consensus of the two most parsimonious trees, this tree also contains multifurcations. More precisely, the two observed trifurcations respectively concern the placement of two Galagoides species and of two Galago species.
The third and fourth source trees (c) and (d) have been inferred from mtDNA sequences combined from the control region homologous with the hypervariable region 1 in humans, COII and cytochrome b (Yoder et al., 2000; Figs. 2 and 3). The third tree (c) contains 18 species and subspecies of Microcebus and Eulemur. The fourth tree (d) contains 40 individuals of the Microcebus genus arranged in nine identified species.
Internal labels of trees (b), (c), and (d) correspond to those displayed in the original figures, whereas those in tree (a) were added manually to demonstrate the ability of ANCESTRALBUILD* to deal accurately with many labels, located at the same or different levels in the trees. The four detailed source trees are ancestrally compatible and Figure 11 shows the supertree resulting from the application of ANCESTRALBUILDS* to this collection. The obtained phylogeny is one of the largest produced for the strepsirrhines, spanning approximately 100 taxa on a number of taxonomic levels, from order to individuals. The source trees used in this example as well as the final supertree are available online from http://www.systematicbiology.org.
|
| Discussion |
|---|
|
|
|---|
ANCESTRALBUILDS* does not take primary data as input, but rather source trees inferred from this data with some level of confidence and through an adequate method. Thus, it is likely that the source trees considered for building a supertree will be more often compatible than, say, a set of primary character data. Nonetheless, it is likely that in many cases the source trees turn out to be incompatible. Providing a faster way than other current supertree methods to detect this incompatibility is a first goal of the algorithm presented in this paper. However, this does not mark the end of its use in the process of building a supertree. Indeed,
- ANCESTRALBUILDS* can be integrated in a general supertree method that builds a supertree by resolving incompatibilities in the source trees.
- Moreover, ANCESTRALBUILDS* as a whole can be used repeatedly to quickly identify compatible subsets of the set of source trees or parts of the source trees that are compatible. The resulting compatible subsets or parts are then combined into a supertree using AncestralBuilds.
Integration of ANCESTRALBUILD* in a General Supertree Method
As stated in the introduction, consistency is an attractive property for any supertree method. Thus, in constructing a general supertree method, deciding compatibility is an integral part of the method. Currently, it seems that the only general supertree methods for rooted semilabeled trees is given in Daniel and Semple (2005). In this paper, the authors describe a general supertree method that allows for the possibility of variants. This method, called NESTEDSUPERTREE, extends ANCESTRALBUILD, and thus ANCESTRALBUILDS*. If the source trees are compatible, then it outputs a supertree that ancestrally displays each of these trees. On the other hand, if the source trees are not compatible, then at some iteration there are no nodes that have indegree 0 and no incident edges. By making an appropriate choice of nodes to delete, NESTEDSUPERTREE, or more particularly one of its variants, resolves this and continues on, eventually returning a supertree with several desirable features including the following:
- ancestrally displaying every rooted binary semilabeled trees that is ancestrally displayed by each of the source trees;
- independent of the order in which the source trees are listed.
The progress made in this paper on the running time of ANCESTRALBUILD improves the practicality of general supertree methods for nested taxa such as NESTEDSUPERTREE.
Repeated Use of ANCESTRALBUILD* in the Production of a Supertree
Despite the exactness ANCESTRALBUILDS*, it can still be used to build a supertree from incompatible source trees. Two ways are highlighted below.
- Finding a subset of the source trees that are compatible. Given an incompatible collection
of source trees, finding a maximum-sized subset of trees in
that are compatible is an NP-hard task (Bryant, 1997). However, heuristic methods can be easily implemented: (i) rank all trees in
according to their size, or to some confidence value on the trees (e.g., bayesian posterior probabilities) or in the primary data set from which they were obtained; (ii) build a compatible collection
'
eq
in the following way: starting from the best ranked tree, consider each source tree of
successively and add it to
' if it forms a compatible collection with the trees already in
', which is checked by ANCESTRALBUILDS*. At the end of the process,
' is a subset of compatible source trees, a supertree of which is provided by the final call to ANCESTRALBUILDS*.
- Finding parts of the source trees that are compatible. Usually, source trees result from an extensive analysis of primary data and their clades are provided with associated confidence values, such as bootstrap values or bayesian posterior probabilities. As a first approximation, we may assume that these confidence values are representative in some sense of the correctness of the corresponding clades (see, e.g., Berry and Gascuel [1996] for a discussion). Thus, when source trees are incompatible, a reasonable option is to first put into question the clades of the source trees that display the least support from the data. This suggests an intuitive and simple scheme to remove conflicts from the source trees by collapsing some of the clades from consideration: Let
be the list of support values for clades of the source trees, sorted by increasing order of confidence. Note that a clade appearing in different trees with different confidence values can be accounted for by resorting to suitable weighting schemes. Collapse clades of the source trees whose support value is equal to the first value of
and remove that value from the list. Then iterate until the modified source trees are compatible. Compatibility is checked every time by using ANCESTRALBUILDS*, the final call providing a supertree from the collection of modified trees.
| Appendix 1. Correctness of Ancestralbuild* |
|---|
|
|
|---|
To ease reading in the following proofs, we view a node
of a restricted descendancy graph that is not a most recent common ancestor label as a single-element set. Theorem 8 follows from Proposition 4 and Lemma 12.
Lemma 11. Let
be a collection of rooted semilabeled trees on X, and let
' be a collection of rooted fully labeled trees that is obtained from
by adding most recent common ancestor labels. Let
1
', and suppose that a,b
(
1)
X such that a and b are not comparable in
1. If 
(
1) is an ancestor label of both a and b in
1, then there is a path in C(
') from a to b in which all nodes on this path are descendant labels of
in
1.
Proof
Without loss of generality, we may assume that
labels the root of
1. Furthermore, since for any element z
(
1)
X, there is a path in C(
') from z to any of its descendant labels in
(
1)
X, we may also assume that
(
1)
X bijectively labels the leaves of
1. This implies that
1 has no degree 2 nodes.
We prove the lemma by showing that, for any pair of elements x and y in
(
1)
X, there is a path joining this pair in C(
') with the property that all nodes on this path are in
(
1)
X. The proof is by induction on the distance d from mrca
1(x,y) to the root of
1. Suppose that the height of
1 is h. If d = h–1, then, by the definition of C(
'), x and y are joined by an edge in C(
') and so the result holds. Now assume that d = h–k, where 2
k
h, and that, for all pairs of elements, w and z say, in
(
1)
X for which the distance from mrca
1(w,z) to the root is greater than h–k, there is a path in C(
') from w to z that only uses elements in
(
1)
X.
Let
and
' be the ancestor labels of x and y in
(
1) that label the sibling nodes of the node of
1 corresponding to the most recent common ancestor of x and y. Then, by the induction assumption, there is a path in C(
') from x to an element in
using just elements in
(
1)
X and there is a path in C(
') from an element in
' to y using just elements in
(
1)
X. Furthermore, by the definition of the edge set of C(
'), there is an edge joining each element in
with each element in
' in C(
'). Hence there is a path from x to y in C(
') of the desired type. This completes the proof of the lemma.
Lemma 12. Let
be a collection of rooted semilabeled trees on X, and let
' be a collection of rooted fully labeled trees that is obtained from
by adding most recent common ancestor labels. Suppose that DESCENDANT* is applied to D*(
') and
, and that at some iteration, i say, Di and Ci are the inputted restrictions of D*(
') and
, respectively, with V(Di)
X = V(Ci) and the following properties holding:
- For all a,c
X, there is a directed path of arcs in Di from c to a with no element x
X as a non terminal node in this path if and only if there is a type (i) edge joining c and a in Ci.
- The set of type (ii) edges of Ci is the disjoint union of sibling edge sets. Furthermore, if e is an edge of DESCENDANT*(
') as a result of
1
', then e is an edge in Di if and only if each of the edges in the sibling edge set of e relative to
1 are in Ci.
- After Step 3'(a) is completed, if S1,
2,...,
k are the node sets of the arc components of Di\
0, then S1
X,
2
X,...,
k
X are the node sets of the blue components of Ci\(
0
X).
- Before Step 3'(c) is performed, an edge e = {
,
'} of Di\
0 joins two arc components if and only if, for each sibling edge set of e, each edge in this set is coloured red and joins two blue components in Ci\(
0
X) with the labels in
in one blue component and the labels in
' in the other blue component.
- After Step 3'(c) is completed, for each arc component of Di\
0 and the corresponding blue component of Ci\(
0
X), (i) and (ii) still hold.
Proof
We first prove (I). As every directed path in Di\
0 must end with an element of X, it follows that, for all i, we have that Si
X is nonempty.
Let
be the node set of a blue component of Ci\(
0
X). Let a and b be nodes in
, and suppose that a and b are joined by a blue edge. Then either
- b is an ancestor of a or a is an ancestor of b in some tree in
, or
- {a,b} is an element of a sibling edge set of an edge, e say, in D*(
').
To complete the proof, we next show that Sj
X is not the (disjoint) union of the node sets of two or more blue components of Ci\(
0
X). To see this, assume that this happens. Then, in the arc component of Di\
0 whose node set is Sj, there is a path of arcs (not necessarily a directed path) from a node c that is in the node set of one blue component of Ci\(
0
X) to a node d that is in the node set of another blue component of Ci\(
0
X). Without loss of generality, we may assume that amongst all such pairs of blue components in Ci\(
0
X) and pairs of elements of X this path is of minimum length. By minimality, apart from c and d, there are no other elements of X on this path. But then each of the non-terminal nodes on this path are elements in
(
')–X. Beacuse each of these labels across all trees in
' are distinct, we deduce that each of these labels come from the same tree,
1 say, in
'. If d is an ancestor of c or c is an ancestor of d in
1, it follows by the minimality condition that {c,d} is a blue edge in Ci\(
0
X); a contradiction. So assume that c and d are noncomparable in
1. By considering the directions of the arcs in the path of directed edges between c and d, it is easily seen that one node on this path,
say, is an ancestor label of both c and d in
1. Because the node
appears in Di\
0, it follows by the fact that V(Di)
X = V(Ci) that each of its descendant labels in
(T1)
X are in Ci\(
0
X). From Lemma 11, we now deduce that there is a path of blue edges in Ci\(
0
X) from c to d, and so c and d are in the same blue component of Ci\(
0
X). This contradiction completes the proof of (I).
To prove (II), first suppose that before Step 3'(c) is performed e = {
,
'} is an edge of Di\
0 joining two arc components. Then no parent node of any sibling edge set of e is in the node set of Di\
0 and so, if it exists, every edge in each of these sets is colored red in Ci\(
0
X). Beacuse {
,
'} is an edge of Di\
0, every element in 

' is in the node set of Ci\(
0
X) and so, by (ii), we deduce that all such edges exist. Now consider the elements in
. If
is a singleton, then, trivially, the label in
is in a single arc component. Therefore, assume that
contains two labels x and y. Then
corresponds to the most recent common ancestor of x and y in some tree in
'. It follows that x and y are in the same arc component of Di\
0. Applying the same arguments to
' and using (I), we deduce one direction of (II).
For the converse of (II), suppose that, for each sibling edge set of e, each of the edges in this set are colored red and join two blue components in Ci\(
0
X) with the labels in
in one blue component and the labels in
' in the other blue component. Then it is immediate from (I) that {
,
'} joins two arc components in Di\
0. This completes the proof of (II).
For (III), consider a component D of Di\
0 and the corresponding component, C say, of Ci\(
0
X). First assume that there is a directed path of arcs in D from c to a with no element x
X as a nonterminal node. Then, as C![]()
0, we have that c and a are elements of C. Because (i) holds for Di and Ci, it immediately follows there is an arc joining c and a in C. A similar argument shows that the converse also holds. Thus D and C satisfy (i).
Now consider (ii) in the statement of the lemma. First observe that, as the set of type (ii) edges of Ci is the disjoint union of sibling edge sets, it follows by (II) and the way in which type (ii) edges of Ci and edges of Di are deleted that the set of type (ii) edges of C is the disjoint union of sibling edge sets. Now assume that e is an edge in D with ends
and
'. Then, by the construction of D*(
'), it follows that 
' is a subset of the node set of D. As all of the elements in 
' are in X, we deduce by (I) that 
' is a subset of the node set of C. Since (ii) holds for Di and Ci, it follows that each of the edges in the sibling edge sets of e are in C. A similar argument shows that the converse also holds.
| Appendix 2. Implementation Details |
|---|
|
|
|---|
We give here a description of ANCESTRALBUILDS* that is more tuned towards implementation. Data structures maintained throughout the algorithmic process are first detailed, then a pseudo code details their precise interactions in the DESCENDANT* subroutine.
Data Structures
is a graph with node set X and whose edges can be one of two colors (blue or red). The nodes of
are stored in an adjacency list. For each node v, we store its blue neighbours (i.e., the nodes connected to v by a blue edge) in a doubly linked list B(v) and its red neighbors in a doubly linked list R(v). These neighbor relations code for the edges of
. Note that an edge {u,v} of a given color, say blue, is represented twice, because both u
B(v) and v
B(u).
D*(
') is a mixed graph with node set
(
'). For each node of D*(
'), we store its indegree, its list of out-going arcs, and the list of nodes to which it is connected by an edge. For each node, we also maintain a list of pointers to the sibling edges of which it is parent. For each sibling edge of D*(
'), we maintain a list of pointers to the corresponding (one to four) edges in
. This pointer is doubled so that these edges in
know in constant time their corresponding sibling edge in D*(
'). Thus, changing in
the color of sibling edges {u,v} of a node x
\
is done in constant time for each edge (recall that B(v), and similarily B(u), is a doubly linked list, so that removing a pointed element out of it and putting it in R(v) is done in constant time).
numC is a table indicating for each node s in
the number of its current blue component in
. Initially, all elements belong to the same component (without loss of generality, we may assume that
is initially connected).
comp is a table storing the blue components of
. Each component is coded as a doubly linked list of nodes of X and is stored in the entry corresponding to its number.
Pseudocode for the DESCENDANT* Subroutine
The heading of the subroutine is the following:
|
First note that the whole graphs D*(
|
Deleting nodes of Sn0 from D*(
Also recall that removing nodes from
at Step 3'(a)(ii) (namely, nodes in Sn0
X) does not change the set of blue components of the graph.
|
S is a table, whose ith entry, Si, is the doubly linked list of nodes of the ith arc component that have indegree 0 and no incident edge. If such a node of D*(
LnextC is the list of new blue components in
(corresponding to the new arc components in D*(
') created during the on-going execution of DESCENDANT*.
is the dynamic connectivity algorithm supporting deletion updates and connectivity queries.
Step 3'(b) identifies new blue components of
resulting from the red coloring given to some of its edges. When a former blue component is broken into two blue components, nodes of the smallest one are identified efficiently according to Even and Shiloach (1981). This smallest component receives a new number nc and its nodes are transferred from the old component to the list in Comp corresponding to this new component.
|
Because of Lemma 12(II), edges between arc components of D*(
|
Note that, in the case where
| Acknowledgments |
|---|
|
|
|---|
The authors are grateful to E. Douzery and P.-H. Fabre for indicating studies on the strepsirrhine evolution used in this paper and for verifying our findings. We thank Olaf Bininda-Emonds, Rod Page, Gabriel Valiente, and an anonymous referee for their valuable comments. We expect the idea of Bininda-Emonds regarding a reference taxonomy as additional input to play an important practical role in the use of supertree algorithms. The first author was supported by the Act. Incit. Inf.-Math.-Phys. en Biol. Mol. [ACI IMP-Bio] and the Act. Inter. Incit. Région. [BIOSTIC-LR]. The second author was supported by the New Zealand Marsden Fund and a University of Canterbury Erskine Grant. This work was done while the second author was a visiting professor at the Université of Montpellier II.
| References |
|---|
|
|
|---|
-
Aho A. V., Sagiv Y., Szymanski T. G., Ullman J. D. Inferring a tree from lowest common ancestors with an application to the optimization of relational expressions. SIAM J. Comput. (1981) 10:405–421.[CrossRef]
Berry V., Gascuel O. On the interpretation of bootstrap trees: Appropriate threshold of clade selection and induced gain. Mol. Biol. Evol. (1996) 13:999–1011.
Bininda-Emonds O. R. P., ed. Phylogenetic supertrees: Combining information to reveal the Tree of Life (2004) Dordrecht: Kluwer.
Bininda-Emonds O. R. P., Jones K. E., Price S. A., Cardillo M., Grenyer R., Purvis A. Garbage in, garbage out. In: Phylogenetic supertrees: Combining information to reveal the Tree of Life—Bininda-Emonds O. R. P., ed. (2004) Dordrecht: Kluwer. Pages 267–280.
Bordewich M., Evans G., Semple C. Extending the limits of supertree methods. Ann. of Comb (2005) (in press).
Bryant D. Building trees, hunting for trees, and comparing trees: Theory and methods in phylogenetic analysis (1997) University of Canterbury. Ph.D. thesis.
Daniel P., Semple C. Supertree algorithms for nested taxa. In: Phylogenetic supertrees: Combining information to reveal the Tree of Life—Bininda-Emonds O. R. P., ed. (2004) Dordrecht: Kluwer. Pages 151–171.
Daniel P., Semple C. A class of general supertree methods for nested taxa. SIAM J. Discrete Math. (2005) 19:463–480.[CrossRef]
Even S., Shiloach Y. An on-line edge deletion problem. J. Assoc. Comput. Machinery (1981) 28:1–4.
Grünewald S., Steel M., Swenson M. Closure operations in phylogenetics (2005) University of Canterbury. Tech. rep.
Henzinger M. R., King V., Warnow T. Constructing a tree from homeomorphic subtrees, with applications to computational evolutionary biology. Algorithmica (1999) 24:1–13.[CrossRef][Web of Science]
Hibbett D., Nilsson R. H., Snyder M., Costanzo M. F. J., Shonfeld M. Automated phylogenetic taxonomy: An example in the homobasidiomycetes (mushroom-forming fungi). Syst. Biol. (2005) 54:660–668.
Holm J., de Lichtenberg K., Thorup M. Poly-logarithmic deterministic fully dynamic algorithms for connectivity, minimum spanning tree, 2-edge, and biconnectivity. (1998) Proceedings of the 30th Annual ACM Symposium on Theory of Computing. New York: ACM Press. Pages 78–89.
Llabrés M., Rocha J., Rosselló F., Valiente G. On the ancestral compatibility of two phylogenetic trees with nested taxa. (2005) Submitted.
Masters J. C., Brothers D. J. Lack of congruence between morphological and molecular data in reconstructing the phylogeny of the galagonoidae. Am. J. Phys. Anthropol. (2002) 117:79–93.[CrossRef][Web of Science][Medline]
Page R. D. M. Modified mincut supertrees. In: Second International Workshop on Algorithms in Bioinformatics—Guig R., Gusfield D., eds. (2002) New York: Springer. Pages 537–552.
Page R. D. M. Taxonomy, supertrees, and the tree of life. In: Phylogenetic supertrees: Combining information to reveal the Tree of Life—Bininda-Emonds O. R. P., ed. (2004) Dordrecht: Kluwer. Pages 247–265.
Page R. D. M., Valiente G. An edit script for taxonomy classifications. BMC Bioinformatics (2005) 6.
Roos C., Schmitz J., Zischler H. Primate jumping genes elucidates strepsirrhine phylogeny. Proc. Nat. Acad. Sci. (2004) 29:10650–10654.
Sanderson M. J., Baldwin B. G., Bharathan G., Campbell C. S., Ferguson D., Porter J. M., Dohlen C. V., Wojciechowski M. F., Donoghue M. J. The growth of phylogenetic information and the need for a phylogenetic database. Syst. Biol. (1993) 42:562–568.
Semple C., Steel M. A supertree method for rooted trees. Discrete Appl. Math. (2000) 105:147–158.[CrossRef]
Semple C., Steel M. Phylogenetics (2003) Oxford: Oxford University Press.
Yoder A. D., Rasoloarison R. M., Goodman S. M., Irwin J. A., Atsalis S., Ravosa M. J., Ganzhorn J. U. Remarkable species diversity in malagasy mouse lemurs (primates, microcebus). Proc. Nat. Acad. Sci. (2000) 97:11325–11330.
Zaroliagis C. D. Implementations and experimental studies of dynamic graph algorithms. In: Experimental algorithms: From algorithm design to robust and efficient software (2002) Vol. 2547. New York: Lecture Notes in Computer Science, Springer. Pages 229–278.
This article has been cited by other articles:
![]() |
V. Ranwez, V. Berry, A. Criscuolo, P.-H. Fabre, S. Guillemot, C. Scornavacca, and E. J. P. Douzery PhySIC: A Veto Supertree Method with Desirable Properties Syst Biol, October 1, 2007; 56(5): 798 - 817. [Abstract] [Full Text] [PDF] |
||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||















