Skip Navigation

Systematic Biology 2006 55(2):270-288; doi:10.1080/10635150500541649
This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (PDF) Freely available
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in ISI Web of Science
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrow Search for citing articles in:
ISI Web of Science (4)
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Berry, V.
Right arrow Articles by Semple, C.
Right arrow Search for Related Content
PubMed
Right arrow Articles by Berry, V.
Right arrow Articles by Semple, C.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?

© 2006 Society of Systematic Biologists

Fast Computation of Supertrees for Compatible Phylogenies with Nested Taxa

Edited by Olaf Emonds: Associate Editor

Vincent Berry1 and Charles Semple2

1 Département Informatique, L.I.R.M.M.–C.N.R.S. 161 rue Ada, 34392 Montpellier Cedex 5, France E-mail: vberry{at}lirmm.fr
2 Biomathematics Research Centre, Department of Mathematics and Statistics, University of Canterbury Christchurch, New Zealand E-mail: c.semple{at}math.canterbury.ac.nz


    Abstract
 Top
 Abstract
 Preliminaries
 Ancestralbuild
 Improving the Running Time...
 Computing the Arc Components...
 Using a Dynamic Data...
 Complexity of Ancestralbuild*
 A Comparison of Running...
 An Example on Primates
 Discussion
 Appendix 1. Correctness of...
 Appendix 2. Implementation...
 Acknowledgments
 References
 
Typically, supertree methods combine a collection of source trees in which just the leaves are labeled by taxa. In such methods the resulting supertree is also leaf labeled. An underlying assumption in these methods is that across all trees in the collection, no two of the taxa are nested; for example, "buttercups" and "plants" are nested taxa. Motivated by Page, the first supertree algorithm for allowing the source trees to collectively have nested taxa is called ANCESTRAL BUILD. Here, in addition to taxa labeling the leaves, the source trees may have taxa labeling some of their interior nodes. Taxa-labeling interior nodes are at a higher taxonomic level than that of their descendants (for example, genera versus species). Analogous to the supertree method BUILD for deciding the compatibility of a collection of source trees in which just the leaves are labeled, ANCESTRAL BUILD is a polynomial-time algorithm for deciding the compatibility of a collection of source trees in which some of the interior nodes are also labeled by taxa. Although a more general method, in this paper we show that the original description of ANCESTRAL BUILD can be modified so that the running time is as fast as the current fastest running time for Build. Fast computation for deciding compatibility is essential if one is to make use of phylogenetic databases that contain thousands of trees on tens of thousands of taxa. This is particularly so as ANCESTRAL BUILD is incorporated as a basic tool inside more general supertree methods (that is, methods that always output a tree regardless of the compatibility of the source trees). We apply the method to propose a comprehensive phylogeny of the strepsirrhines, a major group of the primates.

Keywords: Compatibility; nested taxa; phylogenetics, strepsirrhine phylogeny; supertree methods

Received July 4, 2005; Revised September 9, 2005; Accepted November 4, 2005


Supertree methods are a fundamental and practical way of inferring phylogenies. Generally speaking, these methods amalgamate a collection of "source" trees on overlapping subsets of taxa into a single parent tree that contains the taxa of all of the source trees. This parent tree is called a supertree. This approach to constructing evolutionary trees is particularly appealing because it allows the inference of an evolutionary scenario from a combination of analyses differing in the set of taxa they encompass as well as in the primary data from which they were conducted (for example, molecular or morphological studies). The increasing popularity of these methods and the diversity of ways in which they can be used is highlighted in a recently published book (Bininda-Emonds, 2004).

Historically, supertree methods only make use of the leaf labels of the source trees, with the resulting supertree having the property that taxa are only attached at leaves. On this basis, supertree methods can deal with taxa at different taxonomic levels only in the case where the leaves of the source trees represent non-nested taxa (e.g., "elephant" and "mammal" should not be represented as two distinct leaves amongst the source trees as the former taxa is nested inside the latter taxa). However, in supertree studies this situation does occur, especially concerning outgroups. For example, a source tree containing strepsirrhine species could include Haplorrhini, a suborder, as an outgroup leaf, and another source tree on primates would include several haplorrhine species as leaves, whereas the latter are nested inside the former taxon. Because of the assumption that taxa of the source trees are non-nested, traditional supertree methods produce in that case an incorrect supertree. Thus, unless one resorts to preprocessing of source trees to remove nested taxa via taxonomic synonymy (Bininda-Emonds et al. 2004), traditional supertree methods have limited use in the case of nested taxa, as arises when one is to amalgamate trees from phylogenetic databases such as TreeBASE (Sanderson et al., 1993). This was recently highlighted by Page (2004). Indeed, TreeBASE incorporates trees from many different published phylogenetic studies that focus on a variety of different biological problems, and hence consider evolutionary relationships at different taxonomic levels.

As a consequence of this limitation, Page (2004) motivated the task of designing supertree methods that allow the interior nodes as well as all of the leaves to represent taxa in the source trees. In these nested-taxa source trees, an interior label corresponds to a taxon at a higher taxonomic level than any of its descendants. Note that interior labels can be either available initially in the source trees, or added to some source trees based on taxa present in other source trees at the time the collection of studied source trees is formed. The first computational problem that arises is to find a polynomial-time algorithm for deciding whether or not a collection of nested-taxa source trees is compatible and, if so, constructing an appropriate supertree.

In answer to this problem, Daniel and Semple (2004) provided an algorithm called AncestralBuild. This algorithm generalizes Build, one of the first supertree methods (Aho et al., 1981). The polynomial-time algorithm Build takes a collection Formula of rooted leaf-labeled trees as input and decides whether or not Formula is compatible, in which case it returns a leaf-labeled supertree that displaysFormula . That is, the supertree preserves all of the relative groupings of taxa present in the source trees. For a collection of nested-taxa trees, if a supertree preserves all ancestral relationships as well as all groupings of taxa, then the supertree is said to ancestrally display this collection and the collection is said to be ancestrally compatible. These concepts are formally defined in the next section. However, we make a comment here on the way in which ancestor is used in this paper: by writing that a taxon tis an ancestor of a taxon t', we mean that t is either a hypothetical ancestral taxon of t' or that t is the name of a taxonomic grouping containing t', so that node labeled t is ancestral to the node labeled t'. The algorithm AncestralBuild takes a collection Formula of nested-taxa trees as input and outputs a supertree that ancestrally displays Formula if such a supertree exists; otherwise, it states that the collection is not ancestrally compatible. Though designed to handle trees containing taxa at both internal nodes and leaves, AncestralBuild also accepts collections of source trees that have taxa only at the leaves, because leaf-labeled trees are a special case of nested-taxa trees. In that particular case, AncestralBuild decides the compatibility of the source trees in the usual sense. Consequently, it does indeed generalize Build.

AncestralBuild has the desirable property to give an exact answer in polynomial time Daniel and Semple, 2004). However, there can be three objections to its use. First, for incompatible phylogenies to be combined, an all-or-nothing algorithm, that is, just stating the incompatibility when it arises, is not desirable. Second, even for the easiest case of source trees that are all fully resolved and have taxa only at the leaves, the running time of the version of AncestralBuild stated in Daniel and Semple (2004) is O(t2n3), where t is the number of source trees and n is the number of taxa. Despite being polynomial, this running time makes AncestralBuild a reasonably slow algorithm in practice, particularly if it is to handle the thousands of trees stored in databases such as TreeBASE. Third, in the absence of internal labels in the source trees, the method will build a nonsensical supertree, not knowing for instance that the "mammal" taxon met in one tree is an ancestor of "elephant," present in another tree. We answer these three objections below.

Inferring a Supertree for Incompatible Source Trees
Primary data sequences are getting longer and phylogenetic methods more accurate, so that a reasonable number of published phylogenies are compatible. For example, Llabrés et al. (2005) found a very low rate of incompatible trees in TreeBASE. However, it is still relatively frequent that one has to deal with incompatible source trees. For example, it is well known that gene trees may conflict with species trees (due to, e.g., lateral gene transfer or hybridization), and that some trees may contain erroneous branches inferred due to noisy information for some internal edges in the primary data. In the former case, one may want to construct a supertree to identify the disagreements between the gene trees and the species trees. In such cases, an all-or-nothing algorithm may not be enough and one can understandably prefer a more general supertree method that outputs a supertree that either conflicts with or omits some information present in the source trees. However, for any general supertree method, a basic property that one would always like is that of consistency; that is, if the source trees carry no conflicting information, then the supertree returned by the method displays each of the source trees. Because the property of consistency is such a compelling property, many general supertree methods dealing with leaf-labeled trees (respectively, nested-taxa trees) are likely either to have Build (respectively, AncestralBuild) as a subroutine or to be a variant of Build (respectively, AncestralBuild). Indeed, this is already the case for some general methods: both MinCutSupertree (Semple and Steel 2000) method and its modified version (Page, 2002) are variants of Build and, more recently, Daniel and Semple (2005) describe a class of general supertree methods for nested-taxa source trees that is a variant of AncestralBuild. (This class and, more particularly, the underlying general supertree method NestedSupertree are described further in the last section.) Moreover, these all-or-nothing algorithms can be repeatedly used in simple schemes to extract compatible parts out of a collection of incompatible source trees. We highlight two examples of such schemes in the discussion part of this paper.

Reasonable Running Time
Given the amount of information in current tree databases, it is not unreasonable to try to amalgamate hundreds of trees that collectively contain thousands of taxa and, consequently, fast algorithms are essential. To deal with fully resolved (i.e., binary) leaf-labeled trees, Henzinger et al. (1999) proposed a fast implementation of Build that runs in O(mn1/2) time, where m is the total sum of the number of nodes in each of the source trees and n is the total number of taxa. (They also proposed a variant of Build running in O(m+n2log n) time, but, in the case of compatibility, it typically resulted in a supertree with edges not supported by any part of the data, an undesirable feature for a supertree algorithm.) For comparison between this running time and the O(t2n3) running time of AncestralBuild, note that m can range from O(n) to O(tn). Thus, the running time of AncestralBuild as stated in Daniel and Semple (2004) is still relatively slow compared to the implementation provided by Henzinger et al. (1999) for BUILD in the special case of fully resolved leaf-labeled trees. However, comparing the gap between these running times provides hope for improving AncestralBuild on this aspect, despite the latter being a much more general algorithm, for it allows nested-taxa trees and trees that are partially resolved. In this paper we describe several modifications to AncestralBuild that greatly reduce its computational complexity. Some of these modifications have a similar flavor to techniques used in (Henzinger et al. 1999). The running time of AncestralBuild resulting from these modifications is as fast as the current fastest algorithm for source trees in which only the leaves are labeled Henzinger et al. 1999). More particularly, the achieved running time is almost linear in the size of the source trees, when these trees are fully resolved. Furthermore, extending the implementation of Henzinger et al. (1999) for Build in the most canonical and efficient way to allow partially resolved leaf-labeled trees in its input, we also show that the running time of our modification of AncestralBuild is the same as that for this extension of Build.

Having a Sensible Supertree as Output
In the case where nested taxa are present in different source trees but these trees contain no internal label, then, unless extra information is provided to recognize such nestings, there is no chance that AncestralBuild (or any supertree method) produces a sensible supertree. For example, if one tree contains "elephant" and another "mammal" without any tree indicating that the latter is an ancestor of the former, then the supertree output by the methods will have both taxa as tips. This situation arises because of lack of information in the source trees. Unfortunately, this will be a frequent situation in practice because already inferred trees are usually stored without internal labels, though this is allowed by the Newick format. In such a case, all that is required for the method to produce a sensible supertree is a (partial) reference taxonomy indicating the nestings between taxa of interest. This taxonomy can be included as one extra source tree or be divided into several extra very simple trees of two vertices, where an ancestor taxon is above one of its descendant taxon. This reference taxonomy can be either built by hand, or generated automatically by a program searching taxon names in Web-accessible taxonomies, like NCBI's Taxonomy Browser (http://www.ncbi.nlm.nih.gov/entrez) or the Tree of Life website (http://tolweb.org/tree). See Page and Valiente (2005) and Hibbett et al. (2005) for programs filling aims close to that one.

To describe an application of AncestralBuild, we studied the evolutionary relationships of the Strepsirrhini, one of the two major primate groups. Monophyly of this group is widely accepted but some intrasubordinal relationships were still recently under debate. Recent studies investigating strepsirrhine phylogeny involve quite different kinds of data, including mtDNA sequences (Yoder et al., 2000), craniodental morphological data (Masters and Brothers, 2002), and retroposon analysis (Roos et al., 2004). We applied AncestralBuild to four phylogenies published in these papers, covering many different taxonomic levels from order to individuals. The method shows that these phylogenies are ancestrally compatible and provides as a result a comprehensive supertree of the Strepsirrhini. This highlights, once again, the interest in the supertree approach, which allows us to combine phylogenies inferred from different kinds of primary data.

The rest of the paper is organized as follows. The next section contains some preliminaries that are used throughout the paper. The following three sections contain a description of ANCESTRALBUILD as stated in Daniel and Semple (2004) and descriptions of our modifications of this algorithm as well as the running time of this modification. An application of AncestralBuild to obtain a strepsirrhine supertree is described in the following section. In addition to deciding compatibility, AncestralBuild can be used in a variety of ways for constructing supertrees from incompatible source trees. We highlight some of these ways in the last section. Finally, the appendix contains the proof that establishes that our modification of AncestralBuild does indeed decide ancestral compatibility for a collection of nested-taxa trees as well as a pseudocode description of this algorithm.

We end the introduction by noting that a novel approach for testing the ancestral compatibility of two rooted semilabeled trees has been recently proposed by Llabrés et al. (2005).


    Preliminaries
 Top
 Abstract
 Preliminaries
 Ancestralbuild
 Improving the Running Time...
 Computing the Arc Components...
 Using a Dynamic Data...
 Complexity of Ancestralbuild*
 A Comparison of Running...
 An Example on Primates
 Discussion
 Appendix 1. Correctness of...
 Appendix 2. Implementation...
 Acknowledgments
 References
 
In this section, we describe some concepts that are frequently used in the paper. For further details, we refer the interested reader to Semple and Steel (2003).

Phylogenies
The degree of a node v in a graph (or, in particular, a tree) is the number of edges incident with v. We denote the degree of v by d(v). Essentially, a rooted phylogenetic X-tree is a rooted tree whose leaves are labeled with the elements of a set X of taxa. We formally define it as follows. A rooted phylogenetic tree (on X) is an ordered pair (T;{varphi}) consisting of a rooted tree T in which all interior nodes have degree at least 3, except the root, which has degree at least 2, and a map {varphi} from X to the leaf set of T that assigns each element in X to a leaf in T so that no two elements are assigned the same leaf in T and each leaf of T is assigned an element of X. A rooted phylogenetic tree is fully resolved or binary if the root of T has degree 2 and all other interior nodes of T have degree 3.

The concept of rooted phylogenetic trees naturally extends to trees in which some of the interior nodes are also labeled by taxa. Such trees, called nested-taxa trees or rooted semilabeled trees, are used to display in the same tree taxa at different taxonomic levels. More precisely, a rooted semilabeled tree Formula on a taxa set X is an ordered pair (T;{varphi}) consisting of a rooted tree T with root node {rho}, and a map {varphi} from X into the node set V of T such that for all nonroot nodes v of degree at most 2, {varphi} assigns v an element of X, and if {rho} has degree 0 or 1, then {varphi} also assigns {rho} an element of X. If every node of T is assigned at most one taxa of X under {varphi}, then Formula is said to be singularly labeled. Furthermore, if Formula is singularly labeled and the degree of any node in T is at most 3, except for the root, which has degree at most 2, then we say that Formula is binary. In Figure 1, each of Formula 1, Formula 2, and Formula 3 is a rooted semilabeled tree, but Formula 3 is the only rooted semilabeled tree that is binary.


Figure 1
View larger version (19K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Figure 1 A compatible collection Figure 1 of rooted semilabeled trees.

 
Remarks
  1. We will often write a rooted semilabeled X-tree for a rooted semilabeled tree on X.
  2. Observe that rooted phylogenetic trees are special types of rooted semilabeled trees.
  3. To simplify matters and because we see no practical reason for nodes in the source trees to be assigned more than one taxa of X, we will assume throughout the paper that all rooted semilabeled trees that are source trees are singularly labeled. However, we note that the upgrade of the results in this paper to nonsingular rooted semilabeled trees is straightforward. Note that this remark about singular labeling does not apply to the output tree, where it is quite possible that some nodes are labeled with more than one taxa. For instance, a node joining human and chimp on a source tree that contains no other mammals could be equally labeled as anything from "hominoid" to "primate" to "mammal." This multiple listing of a node then becomes very important if all these names appear on other source trees and there is no user-input taxonomy to sort this out.

Let Formula = (T;{varphi}) be a rooted semilabeled tree on X. The set X is called the label set of Formula and we call the elements of Xlabels. We also use Formula (Formula ) to denote the label set of Formula . For example, in Figure 1, Formula (Formula 1) = {c,d,e,f,x,y}. For a node v of T, we denote the set of elements of X that are assigned to v by {varphi}–1(v) and say that the elements in {varphi}–1(v) labelv. Furthermore, Formula is fully labeled if every node of T is labeled by an element of X. For a collection Formula of rooted semilabeled trees, we denote the union of the label sets of the trees in Formula by Formula (Formula ). Thus, for example, for the collection Formula of rooted semilabeled trees shown in Figure 1, we have Formula (Formula ) = {a,b,c,d,e,f,x,y,z}.

Let Formula = (T;{varphi}) be a rooted semilabeled tree, and let a, bisinFormula (Formula ). We say that a is a descendant (label) of b (or, alternatively, b is an ancestor [label] of a) if the path from the root of T to {varphi}(a) includes {varphi}(b). Furthermore, a and b are not comparable if neither a is a descendant of b nor b is a descendant of a. Now suppose that a and b are not comparable in Formula . Then the node of T that is the last common node on the paths from the root of T to {varphi}(a) and from the root of T to {varphi}(b) is called the most recent common ancestor of a and b, and is denoted mrcaFormula (a,b).

Lastly, throughout the paper, for a rooted semilabeled tree Formula , we will use |Formula | to denote the number of nodes in Formula . Furthermore, for a collection Formula of rooted semilabeled trees, we will use m to denote {sum}Formula isin Formula |Formula |.

Compatibility
We will describe the notion of compatibility for rooted phylogenetic trees first, before extending this notion to rooted semilabeled trees.

Let Formula be a rooted phylogenetic tree on X and let Formula ' be a rooted phylogenetic tree on X', where X is a subset of X'. We say that Formula ' displaysFormula if, up to suppressing all nonroot nodes of degree 2, the minimal rooted subtree of Formula ' that connects the elements in X is a refinement of Formula , that is, Formula can be obtained from it by contracting edges. Suppressing a degree 2 node v means replacing v and its two incident edges with a single edge. Observe that the use of "refinement" means that we allow for the resolution of soft polytomies in the definition of displays, where we recall that a soft polytomy represents an uncertainty as to the exact order of speciation as opposed to a certainty of a multiple speciation event. A collection Formula of rooted phylogenetic trees are compatible if there is a rooted phylogenetic tree Formula that simultaneously displays each of the trees in Formula , in which case we say that Formula displaysFormula .

The notion of displays for rooted semilabeled trees extends the notion of displays for rooted phylogenetic trees. In particular, let Formula be a rooted semilabeled X-tree and let Formula ' be a rooted semilabeled X'-tree, where X is a subset of X'. Then Formula ' ancestrally displaysFormula if, up to suppressing nonroot nodes of degree 2, the minimal rooted subtree of Formula ' that connects the elements in X is a refinement of Formula and, for all a,bisin X, whenever a is a descendant label of b in Formula , we have that a is a descendant label of b in this rooted subtree. A collection Formula of rooted semilabeled trees is ancestrally compatible if there is a rooted semilabeled tree Formula that ancestrally displays each of the trees in Formula , in which case we say that Formula ancestrally displaysFormula . To illustrate this notion, the collection of trees in Figure 1 is ancestrally compatible, as each of its trees is ancestrally displayed by the rooted semilabeled tree shown in Figure 2.


Figure 2
View larger version (8K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Figure 2 A rooted semilabeled tree that ancestrally displays the collection Figure 2 of trees shown in Figure 1. Indeed, as we shall soon see, this is the tree outputted by AncestralBuild when applied to Figure 2.

 
It is important to note that a collection of rooted semi-labeled trees may be ancestrally incompatible because of their interior labels and not their leaf labels. This could be as simple as a contradiction on an ancestor-descendant relationship, or as complex as the incompatibility of a set of most recent common ancestor relationships.

Graphs and Digraphs
Let G be a graph with node set V. We frequently use the notation {u,v} to denote the edge joining the nodes u and v. Furthermore, the connected components of G are the subgraphs of G such that, for all u,visin V, u and v are in the same connected component if and only if there is a path from u to v.

A directed graph, also called a digraph, is simply a graph in which, instead of having edges joining two nodes, we have directed edges; that is, edges directed from one node to another. Directed edges are also called arcs. We use the notation (u,v) to denote the arc directed from u to v. For a directed graph D, the indegree of a node v is the number of arcs directed into v and the outdegree of v is the number of arcs directed out of v. Analogous to the connected components of an ordinary graph, the (connected) arc components of D are the subdigraphs of D such that, for all u,visin V, u and v are in the same arc component if and only if, ignoring the direction of the arcs, there is a path between u and v.

For the purposes of this paper, we say that a graph is mixed if it contains both arcs and edges. An arc component of a mixed graph is an arc component of the digraph obtained when masking the edges of this graph.

Let D be a (mixed) graph, let v be a node of D, and let U be a subset of the set of nodes of D. The (mixed) graph obtained from D by deleting v and each of its incident arcs and edges is denoted by Dv. In general, we use DU to denote the graph obtained from D by deleting each of the nodes in U. Furthermore, if V is the node set of D, then the restriction of D to U (also called the subgraph of D induced by U) is the subgraph of D that is obtained by deleting each of the nodes in VU. This subgraph is denoted by D|U.

Finally, one typically views a rooted tree as an undirected graph. However, it will often be convenient in this paper to view a rooted tree as a directed graph where each edge is replaced with an arc directed away from the root.


    Ancestralbuild
 Top
 Abstract
 Preliminaries
 Ancestralbuild
 Improving the Running Time...
 Computing the Arc Components...
 Using a Dynamic Data...
 Complexity of Ancestralbuild*
 A Comparison of Running...
 An Example on Primates
 Discussion
 Appendix 1. Correctness of...
 Appendix 2. Implementation...
 Acknowledgments
 References
 
For completeness and to make a comparison of the modifications we describe in this paper, in this section we give a full description of AncestralBuild as it is stated in Daniel and Semple (2004).

We begin by describing a construction on a collection of rooted semilabeled trees, and a particular mixed graph on a collection of rooted fully labeled trees. Both the construction and mixed graph are central to AncestralBuild.

Let Formula = (T;{varphi}) be a rooted semilabeled tree on X, where T has node set V. We say that a rooted fully labeled tree Formula ' has been obtained from Formula by adding distinct new labels if, for each node of T that is not assigned a label under {varphi}, we assign it an arbitrary label not in X so that no two new labels are the same. In general, if Formula is a collection of rooted semilabeled trees, we say that Formula ' has been obtained from Formula by adding distinct new labels if it has been obtained by adding distinct new labels to each tree in Formula so that across all trees in Formula ' no two new labels are the same.

Example 1. To illustrate the above construction, let Formula be the collection of rooted semilabeled trees shown in Figure 1. The collection Formula ' shown in Figure 3 has been obtained from Formula by adding distinct new labels. Note that the added labels only need to be unique, and not as involved as the ones shown in Figure 3. The reason for choosing these particular labels will be made clear in the next section.


Figure 3
View larger version (28K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Figure 3 A collection Figure 3' of rooted fully labeled trees.

 
Let Formula ' be a collection of rooted fully labeled trees. The descendancy graph ofFormula ', denoted D(Formula '), is the mixed graph whose node set is Formula (Formula '), and whose arc and edge sets are


Formula

and


Formula

respectively. Figure 4 shows the descendancy graph corresponding to the collection Formula ' of rooted fully labeled trees shown in Figure 3.


Figure 4
View larger version (50K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Figure 4 The descendancy graph of Figure 4'. Arcs are shown as dashed lines with an arrow showing the direction of the arc, whereas edges are shown as solid lines.

 
We now describe AncestralBuild. All of the work in the algorithm is performed by a subroutine called Descendant, which decides the ancestral compatibility of a collection Formula ' of rooted fully labeled trees that has been obtained from the original collection Formula of rooted semilabeled trees by adding distinct new labels. Loosely speaking, Descendant attempts to construct a rooted fully labeled tree that ancestrally displays Formula ', beginning with the cluster Formula (Formula ') and successively breaking it down into disjoint subclusters. The way in which the clusters are broken up is decided by the descendancy graph, which itself is successively broken into node-induced subgraphs. The algorithm either completes the construction of such a tree or returns not ancestrally compatible if at some iteration the associated node-induced subgraph of the descendancy graph has no nodes that have indegree zero and no incident edges.

Algorithm: ANCESTRALBUILD (Formula )
Input: A collection Formula of rooted semilabeled trees on X.
Output: A rooted semilabeled tree Formula that ancestrally displays Formula or the statement Formula is not ancestrally compatible.
  1. Construct a collection Formula ' of rooted fully labeled trees from Formula by adding distinct new labels to the unlabeled nodes in the trees of the collection.
  2. Construct the descendancy graph D(Formula ') of Formula '.
  3. Call the subroutine Descendant (D(Formula ')).
  4. If DESCENDANT returns no possible labeling, then return Formula is not ancestrally compatible. Otherwise, return the semilabeled tree Formula ' returned by DESCENDANT with the added labels removed.

Algorithm: Descendant (D(Formula '))
Input: The descendancy graph of a collection Formula ' of rooted fully labeled trees.
Output: A rooted fully labeled tree Formula ' with root node v' that ancestrally displays Formula ' or the statement no possible labeling.
  1. Let S0 denote the set of nodes of D(Formula ') that have indegree zero and no incident edges. If S0 is empty, then halt and return no possible labeling.
  2. If S0 comprises of exactly one node labeled {ell} with outdegree 0, then return the tree composed of just one leaf labeled {ell}.
  3. Otherwise,
    1. Delete the elements of S0 (and their incident arcs) from D(Formula ') and denote the resulting graph by D(Formula ')\Formula 0.
    2. Find the node sets S1, Formula 2, ..., Formula k of the arc components of D(Formula ')\Formula 0.
    3. Delete all edges of D(Formula ')\Formula 0 whose end nodes are in distinct arc components of this graph.

  4. For each element iisin{1,2,...,k}, call DESCENDANT (D(Formula ')|Formula i). If any of these calls return no possible labeling, then return this message. Otherwise, return the tree whose root node is labeled by S0 and which has Formula 1',...,Formula k' (the trees returned by the recursive calls) as child subtrees.

Remark

  1. With respect to descendancy, the added labels act as necessary "place holders" for unlabeled nodes.
  2. The recursive calls performed at Step 4 in DESCENDANT consider disjoint node induced subgraphs so that the processes applied to these subgraphs in subsequent iterations are independent from one subgraph to another.

Example 2. As an example of AncestralBuild applied to a collection of rooted semilabeled trees, consider the collection Formula of trees shown in Figure 1. Suppose that Step 1 constructs the collection Formula ' of rooted fully labeled trees shown in Figure 3. Now Step 2 builds the descendancy graph D(Formula ') as shown in Figure 4. On the first iteration of DESCENDANT, Step 1 finds mrcaFormula 1(x,y) and mrcaFormula 2(e,z) as the only nodes of D(Formula ') that have indegree zero and no incident edges. Deleting these elements in Step 3 results in the creation of two arc components, one containing nodes a, b, and x and the other containing the nine remaining nodes. Recursive calls to DESCENDANT investigate these two components separately. At Step 4 of the initial call to DESCENDANT, the subtrees returned by the two recursive calls are used as child subtrees of the root of the tree returned there. The root of this tree, which is shown in Figure 5, is labeled by S0 = {mrcaFormula 1(x,y), mrcaFormula 2(e,z)}. All added labels are eventually removed in Step 4 of AncestralBuild. The final tree returned by AncestralBuild applied to Formula is shown in Figure 2.


Figure 5
View larger version (17K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Figure 5 The rooted semilabeled tree returned by Descendant as described in Example 2.

 
The running time of AncestralBuild as it is stated above (and thus in Daniel and Semple [2004]) is given in Proposition 3.

Proposition 3. Let Formula be a collection of rooted semilabeled trees with |Formula | = t and |Formula (Formula )| = n. Then ANCESTRALBUILD (Formula ) runs in time O(t2n3).

Proof
Recall that m = {sum}Formula isinFormula |Formula |. First note that the descendancy graph D(Formula ) contains O(m) nodes and O(tn2) arcs and edges. In the worst case, every execution of DESCENDANT removes only one node in D(Formula ), in which case the subroutine is executed O(m) times. The computation time is dominated by the cost of Steps 3(b) and 3(c) in the subroutine. Finding the connected arc components of a digraph is linear in the number of its nodes and arcs. Thus, assuming that in the worst case only a constant number of edges are removed with each node, an execution of Step 3(b) can require up to O(tn2) time to process the restriction of D(Formula ) it is considering. Because of the m executions of the subroutine, this leads to an overall running time of O(mtn2) for Step 3(b). Finding edges across different arc components in Step 3(c) can necessitate at worst to examine the O(tn2) edges of the graph. This leads to an overall running time of O(mtn2) for Step 3(c). Noting that m = O(tn) gives the final result.

Remark
It is worth noting that the running time of AncestralBuild does not improve when each of the trees in Formula are fully resolved and phylogenetic. Indeed, the descendancy graph still has O(tn2) arcs and edges in such cases.

Daniel and Semple (2004) were only interested in making sure that AncestralBuild runs in time polynomial in the size of the input. Indeed, other than providing a brief check to note that it is polynomial, no consideration to the actual running time is given. Consequently, some simple and not-so simple improvements to the algorithm were overlooked. In the next section we show how this running time can be reduced to almost linear running time. This improvement results from three changes in the algorithm: (i) drastically reducing the number of arcs and edges in the descendancy graph; (ii) using an ad hoc graph of smaller size to compute the arc components of the various restrictions of the descendancy graph and identify edges between them; (iii) working these graph restrictions as actual subgraphs of the initial graph (and not as copies of bits of it), while maintaining node-connectivity through an efficient dynamic data structure. This data structure facilitates the discovery of new arc components resulting from edge deletions.


    Improving the Running Time of Ancestralbuild
 Top
 Abstract
 Preliminaries
 Ancestralbuild
 Improving the Running Time...
 Computing the Arc Components...
 Using a Dynamic Data...
 Complexity of Ancestralbuild*
 A Comparison of Running...
 An Example on Primates
 Discussion
 Appendix 1. Correctness of...
 Appendix 2. Implementation...
 Acknowledgments
 References
 
In this section we describe our modification of AncestralBuild.

Reducing the Size of the Descendancy Graph
We first show that a large amount of information included in the descendancy graph is redundant in the sense that the correctness of the algorithm is maintained when using a restricted version of this graph. For a collection Formula ' of rooted fully labeled trees, let D*(Formula ') be the graph having the same node set as D(Formula '), but whose arc and edge sets are


Formula

and


Formula

respectively. Clearly, the arc and edge sets of D*(Formula ') are subsets of the arc and edge sets of D*(Formula '), respectively, and so we call D*(Formula ') the restricted descendancy graph of Formula '. The restricted descendancy graph of the collection Formula ' shown in Figure 3 is shown in Figure 6.


Figure 6
View larger version (32K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Figure 6 The restricted descendancy graph of Figure 6'. Arcs are shown as dashed lines with an arrow showing the direction of the arc, whereas edges are shown as solid lines.

 
Proposition 4. Let Formula be a collection of rooted semilabeled trees, and suppose that we apply AncestralBuild to Formula but with the restricted descendancy graph replacing the descendancy graph. Then the resulting algorithm applied to Formula returns either
  1. a rooted semi labeled tree that ancestrally displays Formula if Formula is ancestrally compatible, or
  2. the statement Formula is not ancestrally compatible otherwise.

The statement of Proposition 4 is the same statement as Theorem 4.1 (Daniel and Semple, 2004), but without the proviso on using the restricted descendancy graph. Thus, not surprisingly, the proof of Proposition 4 is very similar to their proof. Consequently, to avoid repetition, we refer to parts of the latter proof where appropriate.

Proof of Proposition 4
We first note that, as in the proof of Theorem 4.1 (Daniel and Semple, 2004), it suffices to show that the result holds when Formula is a collection of rooted fully labeled trees. The proof of (i) is the same as the proof of Theorem 4.1 (Daniel and Semple, 2004).

For the proof of (ii), suppose that AncestralBuild (using the restricted descendancy graph) outputs a rooted semilabeled tree Formula '. We show that Formula ' ancestrally displays Formula . Let Formula 1 be an element of Formula . By Lemma 2.1 (Bordewich et al., 2005), it suffices to show, for all a,bisin Formula (Formula 1) that (I) if a is a descendant label of b in Formula 1, then a is a descendant label of b in Formula ', and (II) if a and b are non-comparable in Formula 1, then a and b are noncomparable in Formula '.

The argument for (I) is very similar to the corresponding argument in the proof of Theorem 4.1(ii) (Daniel and Semple, 2004), and so we omit it. To show (II), suppose that a and b are not comparable in Formula 1. Assume that Formula 1 = (T1;{varphi}1). Let v be the node in T1 that is the most recent common ancestor of {varphi}1(a) and {varphi}1(b). By the construction of D*(Formula ), there is a pair of children, c and d say, of the label labeling v in T1 such that c and d are joined by an edge, and c is an ancestor label of a, and d is an ancestor label of b. Because we eventually output a tree, this edge is eventually deleted, but not until c and d, and hence a and b are in separate arc components of some restriction of D*(Formula ). It now follows that a and b are not comparable in Formula '.

Remark
Let Formula be a collection of rooted semilabeled trees with |Formula | = t and |Formula (Formula )| = n, and let Formula ' be a collection of fully labeled trees that is obtained from Formula by adding distinct new labels. Let m = {sum}Formula isinFormula |Formula |. Then the mixed graph D*(Formula ') contains O(m) nodes and arcs. However, the number e of edges in D*(Formula ') is a function of the degree of the nodes in the source trees. In particular, D*(Formula ') contains


Formula

edges, where I(Ti) denotes the set of interior nodes of tree Ti for all i. Note that, depending on the degree of overlap of the source trees and the degree of each of their nodes, e can range from O(m) to O(tn2). In particular, if the source trees are all fully resolved, then D*(Formula ') contains O(m) nodes, arcs, and edges.


    Computing the Arc Components Via a Smaller Graph
 Top
 Abstract
 Preliminaries
 Ancestralbuild
 Improving the Running Time...
 Computing the Arc Components...
 Using a Dynamic Data...
 Complexity of Ancestralbuild*
 A Comparison of Running...
 An Example on Primates
 Discussion
 Appendix 1. Correctness of...
 Appendix 2. Implementation...
 Acknowledgments
 References
 
Despite the obvious improvements given by Proposition 4, it is Step 3(b) (finding the arc components) and, to a lesser extent, Step 3(c) (finding the edges joining distinct arc components) that have the biggest influence on the running time of AncestralBuild because these steps are performed a high number of times on relatively large graphs. To speed up these parts of the algorithm, we introduce an additional graph (which we call the "component graph"). The reason for this graph is that identifying the arc components of the restricted descendancy graph can be reduced to identifying the connected components in the component graph. The component graph is typically smaller than the restricted descendancy graph, and it is this smallness that provides the improvement in the running time of the algorithm. To describe the component graph, we first need to some additional concepts.

Let Formula = (T;{varphi}) be a rooted semilabeled tree on X, where T has node set V. We say that Formula ' is obtained from Formula by adding most recent common ancestor labels if, for each node z of degree at least three in which {varphi}–1(z) is empty, we assign the label mrcaFormula '(a,b) to z, where a,bisin Formula (Formula ) and z is the most recent common ancestor of {varphi}(a) and {varphi}(b). By choosing leaf labels if necessary, we can always find appropriate choices for a and b. Because each of the newly added labels are distinct, this construction is a special case of adding distinct new labels. In the paper, we often view the label mrcaFormula '(a,b) as the set {a,b} and freely move between the two viewpoints. Note that the choice of a and b need not be unique, however, which choice is made is irrelevant. In general, if Formula is a collection of rooted semilabeled trees, we say that Formula ' has been obtained from Formula by adding most recent common ancestor labels if it has been obtained by adding most recent common ancestor labels to each tree in Formula . Note that the newly added labels across Formula ' are distinct as every most recent common ancestor label refers to a particular tree in Formula .

Example 5. To illustrate the last construction, the collection Formula ' of rooted semilabeled trees shown in Figure 3 has been obtained from the collection Formula of rooted semilabeled trees shown in Figure 1 by adding most recent common ancestors labels. For instance, the label mrcaFormula 1(x,y) is assigned to the node of Formula 1 that is the most recent common ancestor of nodes labeled x and y.

Let Formula be a collection of rooted semilabeled trees on X, and let Formula ' be a collection of rooted fully labeled trees obtained from Formula by adding most recent common ancestor labels. To describe the component graph of Formula ', we simultaneously consider the restricted descendancy graph of Formula '. For a tree T and a node z in T, we say that u and v are siblings if both u and v are distinct children of z. Let Formula = (T;{varphi}) be an element of Formula ', and let u and v be siblings of a node z in T, and consider {varphi}–1(u) and {varphi}–1(v). By the definition of the restricted descendancy graph of Formula ', we have that {varphi}–1(u) and {varphi}–1(v) are joined by an edge e in D*(Formula '). (Note that all edges of D*(Formula ') are derived from a tree in Formula ' in this way.) Furthermore, in D*(Formula '), there is an arc from {varphi}–1(z) to {varphi}–1(u) and an arc from {varphi}–1(z) to {varphi}–1(v). Relative to Formula , we call the set


Formula

the sibling edge set of e or, if we are referring to {varphi}–1(z), we call it a sibling edge set of {varphi}–1(z). Moreover, {varphi}–1(z) is the parent node of this edge set.

The component graph of Formula ', denoted C(Formula '), has node set X and two types of edges. In particular, for two nodes a and b, there is

  1. an unlabeled edge joining a and b precisely if there is a tree Formula isin Formula with b an ancestor of a and no element x in Formula (Formula )–{a,b} such that b is an ancestor of x and x is an ancestor of a; and
  2. an edge joining a and b labeled ({ell}, e) precisely if, relative to a tree Formula isin Formula ' with label {ell}, we have that {a,b} is in a sibling edge set of {ell} and an edge e in D*(Formula ').
Edges in C(Formula ') arising because of (i) are called type (i) edges and edges arising because of (ii) are called type (ii) edges. Note that type (i) edges are decided by the composition of Formula , whereas type (ii) edges are decided by that of Formula '. Also, two nodes in Formula can be joined by more than one edge. However, these edges are not treated equally as there is at most one unlabeled edge and each of the labeled edges are labeled with different ordered pairs.

Example 6.Figure 7 shows the component graph C(Formula ') for the collection Formula ' of rooted fully labeled trees shown in Figure 3, where type (i) edges are represented as dashed edges and type (ii) edges are represented as solid edges. For clarity, type (ii) edges are not labeled. To illustrate, {z,y} is a type (i) edge resulting from the fact that z is an ancestor of y in the tree Formula 2' of Formula '; the fact that {z,e,x} are siblings in this tree results in the three type (ii) edges {z,e}, {z,x}, and {e,x} in Formula . Viewing mrcaFormula 1(e,y) as {e,y}, the second edge {e,x} joining e and x results from the fact that mrcaFormula 1(e,y) and x are siblings in Formula 1', which generates the sibling edge set {{e,x},{y,x}} in Formula . Thus one of the edges joining e and x is labeled (mrcaFormula 2(e,z), {e,x}), whereas the other edge joining e and x is labeled (mrcaFormula 1(x,y), {mrcaFormula 1(e,y),x}).


Figure 7
View larger version (22K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Figure 7 The component graph of Figure 7'.

 
Remark
Asymptotically, the component graph Formula contains the same number of edges as D*(Formula '). However, Formula contains only n nodes, whereas D*(Formula ') contains O(m) nodes. Potentially, this means that the latter gains a factor, as the size of m is O(tn). Thus, computing connected components in Formula is likely to be faster than computing them in D*(Formula '). Indeed, in the next section we show how the computation of arc components in Step 3(b) and determining which edges are to be deleted in Step 3(c) can be made faster by resorting to Formula .

At last we describe our full modification of AncestralBuild called AncestralBuild*. Analogous to Descendant in AncestralBuild, the algorithm AncestralBuild* includes a subroutine that we call Descendant*. Intuitively, apart from using the restricted descendancy graph instead of the descendancy graph, the main difference is that all of the work in finding the arc components and deciding which edges join two different arc components at each iteration of the subroutine is now done by the component graph and its various node-induced subgraphs. In the modification, all of the edges of the component graph are initially colored blue. In association with the component graph (or any of its node induced subgraphs), a blue component is a connected component of the graph obtained when masking non-blue edges. To describe AncestralBuilds, we highlight the changes to AncestralBuild:

  1. In Step 1, Formula ' is now obtained from Formula by adding most recent common ancestor labels.
  2. Step 2 is replaced with the constructions of the restricted descendancy graph D*(Formula ') and the component graph C(Formula ') of Formula '.
  3. The input to the subroutine Descendant is initially D*(Formula ) and C(Formula '). For recursive calls to Descendant (Step 4), the input is Formula |Formula i and C(Formula ')|(Formula i{cap} X).
  4. Step 3 of Descendant is replaced with Step 3'.

  1. Otherwise,
      1. Delete the elements of S0 (and their incident arcs) from D*(Formula ') and denote the resulting graph by Formula \Formula 0.
      2. Delete the elements of S0{cap} X (and their incident edges) from C(Formula ') and denote the resulting graph by C(Formula ') (Formula 0{cap}; X).
      3. For every element of S0, colour all edges in each of its sibling edge sets in C(Formula ') (Formula 0{cap} X) red.

    1. Find the node sets Formula 1,Formula 2,...,Formula k of the blue components of C(Formula ')\(Formula 0{cap} X). The arc components of Formula \Formula 0 are S1, Formula 2,..., Formula k, where Si{cap} X = Formula i for all i.
    2. Delete all red edges joining two different blue components of C(Formula ')\(Formula 0{cap} X). For each sibling edge set that is deleted, delete the corresponding edge in Formula \Formula 0.

The modified subroutine is called Descendant*.

Remarks
In the following remarks and in the next section, we will assume for reasons of convenience that the input to DESCENDANT* is always D*(Formula ') and Cgraph. This is in name only. (Strictly speaking, the input of recursive calls are node induced subgraphs of these graphs.)

  1. At the beginning and at the end of each iteration of DESCENDANT*, the node sets of the blue components of Cgraph correspond to the node sets of the arc components of D*(Formula ').
  2. Although deleting the elements in S0 and their incident edges in D*(Formula ') has the potential to create arc components A1,A2,...,Ak, deleting the elements in S0{cap} X in C(Formula ') will not create any blue components. This is because the sibling edges corresponding to the elements in S0 are still colored blue in the resulting subgraph of C(Formula '). However, Step 3'(a)(iii) colors these sibling edges red and it is this recoloring that reestablishes the correspondence described in Step 3'(b).
  3. The fact that the arc components of D*(Formula ) correspond to the blue components of C(Formula ')\(Formula 0{cap} X) as stated in Step 3'(b) is established in Lemma 12.
  4. Referring to Step 3'(c) of DESCENDANT*, the set of red edges that are deleted is a union of sibling edge sets. Again this is established in Lemma 12.

Example 7. Consider the execution of ANCESTRALBUILDS* on the collection of rooted semilabeled trees shown in Figure 1 which in Step 1 constructs the collection Formula ' of rooted fully-labelled trees shown in Figure 3 by adding most recent common ancestors. Refer to Figure 4 and 7, respectively, for D*(Formula ') and Cgraph, which are constructed in Step 2.

At the first call of DESCENDANT*, Step 3'(a)(i) deletes the nodes in


Formula

from D*(Formula '). These nodes are not in the original taxa set X, thus Step 3'(a)(ii) has nothing to delete from Cgraph. Step 3'(a)(iii) colors the edges in the sibling edge sets of mrcaFormula 1(x,y) and mrcaFormula 2(e,z) red in Cgraph; that is, the edges in {{e,x},{y,x}} and {{z,e},{z,x},{e,x}}, respectively, where we recall that the edge {e,x} in each of these sets is treated differently; one is labeled (mrcaFormula 1(x,y), {mrcaFormula 1(e,y),x}) and the other is labeled (mrcaFormula 2(e,z), {e,x}) in C(Formula '). As a result, Step 3'(b) identifies two blue components, Formula 1 and Formula 2 say, with a, b, and x in Formula 1 and the remaining nodes in Formula 2. The corresponding arc components of Formula \Formula 0 are the same as the ones found in Example during the first call to DESCENDANT. Step 3'(c) removes from Cgraph the red edges between Formula 1 and Formula 2; that is, the edges in {{e,x},{y,x},{z,x},{e,x}}, leaving {z,e} as the only red edge of the graph at this point. The rooted semilabeled tree that is eventually returned at the end of recursive calls is the tree shown in Figure 2. The fact that this is the same tree as the one outputted by AncestralBuild applied to Formula is no coincidence (see Theorem 8).

Together with Proposition 4, the correctness of ANCESTRALBUILDS* is established in Appendix 1. In particular, we have the following theorem.

Theorem 8. Let Formula be a collection of rooted semilabeled trees. Then, applying the algorithm ANCESTRALBUILDS* to Formula returns either

  1. a rooted semilabeled tree that ancestrally displays Formula if Formula is ancestrally compatible, or
  2. the statement Formula is not ancestrally compatible otherwise.
Moreover, if ANCESTRALBUILDS* returns a tree, then (up to isomorphism) the tree returned by ANCESTRALBUILDS* is the same as that returned by AncestralBuild when applied to Formula .


    Using a Dynamic Data Structure to Reduce the Computation Time of Connected Components
 Top
 Abstract
 Preliminaries
 Ancestralbuild
 Improving the Running Time...
 Computing the Arc Components...
 Using a Dynamic Data...
 Complexity of Ancestralbuild*
 A Comparison of Running...
 An Example on Primates
 Discussion
 Appendix 1. Correctness of...
 Appendix 2. Implementation...
 Acknowledgments
 References
 
Different calls to the subroutine DESCENDANT* consider different restrictions of the initial component graph Formula . Moreover, the recursive calls issued during an on-going call of the subroutine consider node disjoint restrictions of Formula . This shows that Formula can be shared by all calls of the subroutine without the need to keep copies of parts of it. Edges are progressively removed from Formula as the different calls to to DESCENDANT* are executed. Each call has to determine the resulting blue connected components of the part of Formula it is considering.

As shown in Henzinger et al. (1999) for a similar problem, this problem greatly simplifies if connected components of the graph are maintained in a separate data structure handled by a dynamic connectivity algorithm. This ad hoc data structure ensures that connectivity queries on the graph (i.e., asking whether two given nodes are in the same component) and updates of the graph (here, removing an edge) are performed efficiently. For example, the dynamic algorithm of Holm et al. (1998) supports each connectivity query in O(log n/log log n) and each edge deletion update in O(log 2n), where n is the number of nodes of the graph. Many implementations of dynamic connectivity algorithms have been proposed, and we refer the reader to Zaroliagis (2002) for a recent survey and experimental comparison of their running times. The next section details how the dynamic data structure comes into play in the execution of Steps 3'(b) and 3'(c) of DESCENDANT*.


    Complexity of Ancestralbuild*
 Top
 Abstract
 Preliminaries
 Ancestralbuild
 Improving the Running Time...
 Computing the Arc Components...
 Using a Dynamic Data...
 Complexity of Ancestralbuild*
 A Comparison of Running...
 An Example on Primates
 Discussion
 Appendix 1. Correctness of...
 Appendix 2. Implementation...
 Acknowledgments
 References
 
In this section, we establish the running time of ANCESTRALBUILDS*. In addition to the implementation details given here, more concrete details can be found in Appendix 2.

For a collection Formula of rooted semilabeled trees, the burden of the computation in ANCESTRALBUILD*\(Formula ) lies in the subroutine DESCENDANT* and, more particularly, in Steps 3'(b) and 3'(c) when dealing with the computations in the graph Formula . For the exposition that follows, let n be the number of nodes and e' be the number of edges of Formula .

Step 3'(b) identifies the blue components of Cgraph\(Formula 0{cap} X). At the beginning of a call to DESCENDANT*, there is only one blue component, and then new ones result from the deletion of nodes and the change of color of edges performed in Step 3'(a) of the ongoing call. As already remarked, these new blue components arise more precisely at Step 3'(a)(iii). To determine resulting new blue components, Step 3'(b) sends one by one deletion updates to the dynamic algorithm, for each of the edges {a,b} turned red in Step 3'(a)(iii). After each such deletion update, a connectivity query is issued to the dynamic connectivity algorithm, to check whether a and b are still in the same component. If the answer is negative, then turning red this edge resulted in the division of a blue component C in Formula into two blue components, Ca and Cb say, where a is in the node set of Ca and b is in the node set of Cb. Starting with a and b, the other nodes of these components are then discovered by examining (via blue edges) neighbors of nodes that are in Ca and Cb, respectively. This examination processes each new node alternatively for Ca and Cb, and halts as soon as all nodes of the smallest of the two components have been found. This small component is considered new and the other is considered to be the original component C having lost some nodes. This technique, due to Even and Shiloach (1981), guarantees that each of the e' edges in the graph belongs to a new component at most log n times over all executions of Step 3'(b). This bounds the number of times an edge is examined whilst it is blue. Moreover, the overall number of deletion updates and connectivity queries issued to the dynamic algorithm is proportional to the number of edges initially in the graph, i.e., O(e').

Step 3'(c) identifies and deletes all red edges joining two distinct blue components of Formula \(Formula 0{cap} X). This is done by examining red edges separating new (hence small) blue components from other blue components. To identify these edges, all red edges incident to a node in a new blue component are examined. Those edges that are not separating two blue components are ignored (they will be removed at a later stage); those separating two blue components are removed from Cgraph\(Formula 0{cap} X). Note that it is possible to distinguish between these two situations in constant time without issuing a connectivity query to the dynamic algorithm. It suffices to associate a number to each blue component and to maintain in Step 3'(b) a table indicating for each node of Formula the number of the blue component to which it currently belongs. Because new blue components are small (i.e., contain at most half of the nodes of the component from which they originate), each red edge is examined at most log n times before being deleted from the graph. Due to Henzinger et al. (1999), this technique leads to an overall running time of O(e'log n) to delete red edges between blue components over all executions of Step 3'(c).

Lemma 9. Let Formula be a collection of rooted semilabeled trees with |Formula (Formula )| = n. Performing Steps 3 < eqid8 > (b) and 3 < eqid9 > (c) over all executions of DESCENDANT* < eqid10 > Formula , Formula ) during algorithm ANCESTRALBUILD*(Formula ) costs O(elog 2n) running time, where e is the initial number of edges in D*(Formula ').

Proof
Let e' be the initial number of edges in Formula . As stated above, computing the blue components of C(Formula ') over all executions of Step 3'(b) necessitates O(e'log n) operations on the graph Formula plus O(e') deletion updates and connectivity queries to a dynamic algorithm maintaining connectivity between nodes in Formula through an ad hoc data structure. Using the dynamic algorithm of Holm et al. (1998) leads to an O(e'log 2n) running time for all executions of this step.

As also stated above, each of the e' edges of C(Formula ') is investigated at most log n times during all executions of Step 3(c) with each such investigation being done in constant time. Thus, removing red edges between blue components globally requires O(e'log n) time. Now edges of D*(Formula ') that correspond to these red edges of Formula are known immediately because of pointers that are maintained between each edge of D*(Formula ') and its associated sibling edge set in C(Formula '). Thus, when all edges in a sibling edge set have been removed from Formula , removing the corresponding edge in D*(Formula ') is done in constant time. As there are O(e) edges in D*(Formula '), this requires O(e) time over the whole execution of ANCESTRALBUILDS*.

Hence, Step 3'(b) is the most time consuming, and noting that e' = O(e) gives the stated result.

Because Steps 3'(b) and 3'(c) of the subroutine DESCENDANT* are the most time consuming steps during an execution of AncestralBuilds, Theorem 10 is an immediate consequence of Lemma 9.

Theorem 10. Let Formula be a collection of rooted semilabeled trees with |Formula (Formula )| = n. Then ANCESTRALBUILD*(Formula ) runs in time


Formula

where I(Ti) denotes the set of interior nodes of Ti for all i.

Remark
In general, ANCESTRALBUILDS* allows for the source trees to be rooted semilabeled trees of unbounded degree. However, in the special case where the source trees are all rooted binary semilabeled trees, the running time of ANCESTRALBUILDS* is O(mlog 2n). This running time is the same as the running time of the algorithm in Henzinger et al. (1999) whose source trees are all rooted binary phylogenetic trees. (The running time of this algorithm can be improved from O(mnFormula ) as stated in Henzinger et al. [1999] to O(mlog 2n) by changing the dynamic connectivity algorithm it resorts to.) This running time is almost linear in the size of the input, which guarantees short execution times even on large data sets.


    A Comparison of Running Times for Partially Resolved Trees
 Top
 Abstract
 Preliminaries
 Ancestralbuild
 Improving the Running Time...
 Computing the Arc Components...
 Using a Dynamic Data...
 Complexity of Ancestralbuild*
 A Comparison of Running...
 An Example on Primates
 Discussion
 Appendix 1. Correctness of...
 Appendix 2. Implementation...
 Acknowledgments
 References
 
Let Formula be a collection of rooted semilabeled trees. Ideally, one would like the running time of an algorithm that determines the compatibility of Formula to not depend on whether or not Formula contains partially resolved trees. The method ANCESTRALBUILD* has this dependency; the running time in Theorem 10 includes the factor {sum}Formula iisinFormula {sum}uisin I(Ti)d(u)2. The reason for this factor is that if Formula ' is a collection of rooted fully labeled trees obtained from Formula by adding most recent common ancestor labels, then, for each tree Formula ' in Formula ' and each label {ell} in Formula (Formula '), the number of edges in the descendancy graph of Formula ' joining pairs of siblings of {ell} is quadratic in the number of siblings. Unfortunately, given our current approach, there appears to be no way to remedy this. To see this, suppose that one can always choose a linear number of such edges. We will assume that this choice is independent amongst the trees in Formula '. In the consideration of running times, this assumption is reasonable, for otherwise, one has to make O(t2) comparisons amongst the trees in Formula ', where |Formula (Formula )| = t. We next describe a collection of rooted semilabeled trees such that using only a linear number of edges in D*(Formula ') leads AncestralBuild* to incorrectly return a tree when applied to this collection.

A rooted triple is a rooted phylogenetic tree that has two interior nodes and whose label set has size three. We denote the rooted triple Formula with label set {a,b,c} by ab|c if the path from a to b does not intersect the path from the root to c.

Let Formula = {Formula 1,Formula 2}, where Formula 1 is the rooted phylogenetic tree shown in Figure 8 and Formula 2 is a rooted triple that will be described shortly. Suppose that ANCESTRALBUILD*; is applied to Formula with the linearity condition described above. Let Formula ' be the collection of rooted fully labeled trees obtained in Step 1 of ANCESTRALBUILD*. Because of our assumption, the number of pairs of elements of


Formula

that are joined by edges in the descendancy graph D*(Formula ') of Formula ' is linear in n. This implies that there is a pair, {a11,a12} and {a21,a22} say, not joined by an edge. Now set Formula 2 to be the rooted triple a12a21|a22. Clearly, Formula 1 and Formula 2 are not compatible, yet the rooted semilabeled tree shown in Figure 9 is returned by this application of ANCESTRALBUILD*.


Figure 8
View larger version (10K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Figure 8 The rooted phylogenetic tree Figure 81.

 


Figure 9
View larger version (10K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Figure 9 The tree outputted by ANCESTRALBUILD* when applied to {Figure 91,Figure 92} with a particular linearity condition.

 
The running-time dependency on partially resolved trees may possibly be inherent in any supertree method for determining compatibility of the input collection, even in the case this collection consists of just leaf-labeled trees. To make a comparison with the running time in Theorem 10, we examine what appears to be the most canonical and natural extension of the algorithm in Henzinger et al. (1999) for deciding the compatibility of a collection Formula of rooted binary phylogenetic trees to deciding the compatibility of a collection of arbitrary rooted phylogenetic trees.

To aid the running time of the algorithm in Henzinger et al. (1999), the source trees in Formula are encoded as a collection of rooted triples. The resulting collection is displayed by a rooted phylogenetic tree Formula if and only if Formula is displayed by Formula . Extending this to a collection of source trees that contain arbitrary rooted phylogenetic trees, the minimum number of rooted triples required for the encoding is O({sum}TiisinFormula {sum}uisin I(Ti)d(u)2) (Grünewald et al., 2005). The complexity of the resulting algorithm would be the same as the one stated in Theorem 10 for ANCESTRALBUILDS*.


    An Example on Primates
 Top
 Abstract
 Preliminaries
 Ancestralbuild
 Improving the Running Time...
 Computing the Arc Components...
 Using a Dynamic Data...
 Complexity of Ancestralbuild*
 A Comparison of Running...
 An Example on Primates
 Discussion
 Appendix 1. Correctness of...
 Appendix 2. Implementation...
 Acknowledgments
 References
 
As an application of ANCESTRALBUILD*, we now consider the phylogeny of Strepsirrhini, one of the two major groups of primates. To infer this phylogeny, we use the ability of supertree methods to indirectly combine data of different kinds and ANCESTRALBUILD* to combine a set of source trees with nested taxa. The supertree is obtained from four source trees deriving from (i) retroposon data (Roos et al., 2004); (ii) morphological data (Masters and Brothers, 2002); and (iii) molecular data (Yoder et al., 2000) (see Fig. 10).


Figure 10
View larger version (256K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Figure 10 Four source trees of Strepsirrhini, one of the two major groups of primates. These source trees are derived from retroposon, morphological, and molecular data.

 
The first source tree (a) has been obtained from retroposon analyses (Roos et al., 2004, Fig. 2), namely from 61 loci containing short, interspersed elements (SINEs), translated into a presence-absence pattern at orthologous loci on 21 strepsirrhine species. These data contain no homoplasy and indicate unambiguously a unique tree. However, this tree does not resolve all phylogenetic relationships of the strepsirrhines: it exhibits a trifurcation involving Galagoides,Otolemur, and Galago, as well as a second trifurcation involving the three groups of the Lemuroidea (Lepilemur, Cheirogaleidae, and the group composed of Lemuridae and Indridae).

The second source tree (b) contains 13 species of the Galagonidae family and has been inferred from craniodental morphological data (Masters and Brothers, 2002; fig. 6a). This tree resolves the first trifurcation mentioned above. As it is the strict consensus of the two most parsimonious trees, this tree also contains multifurcations. More precisely, the two observed trifurcations respectively concern the placement of two Galagoides species and of two Galago species.

The third and fourth source trees (c) and (d) have been inferred from mtDNA sequences combined from the control region homologous with the hypervariable region 1 in humans, COII and cytochrome b (Yoder et al., 2000; Figs. 2 and 3). The third tree (c) contains 18 species and subspecies of Microcebus and Eulemur. The fourth tree (d) contains 40 individuals of the Microcebus genus arranged in nine identified species.

Internal labels of trees (b), (c), and (d) correspond to those displayed in the original figures, whereas those in tree (a) were added manually to demonstrate the ability of ANCESTRALBUILD* to deal accurately with many labels, located at the same or different levels in the trees. The four detailed source trees are ancestrally compatible and Figure 11 shows the supertree resulting from the application of ANCESTRALBUILDS* to this collection. The obtained phylogeny is one of the largest produced for the strepsirrhines, spanning approximately 100 taxa on a number of taxonomic levels, from order to individuals. The source trees used in this example as well as the final supertree are available online from http://www.systematicbiology.org.


Figure 11
View larger version (184K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Figure 11 Supertree of the strepsirrhines resulting from four compatible source trees, inferred from retroposon, morphological, and molecular data. This phylogeny demonstrates the ability of ANCESTRALBUILDS* to deal with rooted semilabeled trees spanning many different taxonomic levels, from individuals (e.g., 66-Ankarafantsika) to order (i.e., Primates).

 

    Discussion
 Top
 Abstract
 Preliminaries
 Ancestralbuild
 Improving the Running Time...
 Computing the Arc Components...
 Using a Dynamic Data...
 Complexity of Ancestralbuild*
 A Comparison of Running...
 An Example on Primates
 Discussion
 Appendix 1. Correctness of...
 Appendix 2. Implementation...
 Acknowledgments
 References
 
ANCESTRALBUILDS* does not take primary data as input, but rather source trees inferred from this data with some level of confidence and through an adequate method. Thus, it is likely that the source trees considered for building a supertree will be more often compatible than, say, a set of primary character data. Nonetheless, it is likely that in many cases the source trees turn out to be incompatible. Providing a faster way than other current supertree methods to detect this incompatibility is a first goal of the algorithm presented in this paper. However, this does not mark the end of its use in the process of building a supertree. Indeed,
  1. ANCESTRALBUILDS* can be integrated in a general supertree method that builds a supertree by resolving incompatibilities in the source trees.
  2. Moreover, ANCESTRALBUILDS* as a whole can be used repeatedly to quickly identify compatible subsets of the set of source trees or parts of the source trees that are compatible. The resulting compatible subsets or parts are then combined into a supertree using AncestralBuilds.
We indicate below some hints in both these directions.

Integration of ANCESTRALBUILD* in a General Supertree Method
As stated in the introduction, consistency is an attractive property for any supertree method. Thus, in constructing a general supertree method, deciding compatibility is an integral part of the method. Currently, it seems that the only general supertree methods for rooted semilabeled trees is given in Daniel and Semple (2005). In this paper, the authors describe a general supertree method that allows for the possibility of variants. This method, called NESTEDSUPERTREE, extends ANCESTRALBUILD, and thus ANCESTRALBUILDS*. If the source trees are compatible, then it outputs a supertree that ancestrally displays each of these trees. On the other hand, if the source trees are not compatible, then at some iteration there are no nodes that have indegree 0 and no incident edges. By making an appropriate choice of nodes to delete, NESTEDSUPERTREE, or more particularly one of its variants, resolves this and continues on, eventually returning a supertree with several desirable features including the following:

  1. ancestrally displaying every rooted binary semilabeled trees that is ancestrally displayed by each of the source trees;
  2. independent of the order in which the source trees are listed.
We also remark that NESTEDSUPERTREE runs in polynomial time and allows for the source trees to be weighted. Such weights, irrelevant for deciding compatiblity (and thus ignored by ANCESTRALBUILD), can really help to arbitrate the conflicts between incompatible source trees.

The progress made in this paper on the running time of ANCESTRALBUILD improves the practicality of general supertree methods for nested taxa such as NESTEDSUPERTREE.

Repeated Use of ANCESTRALBUILD* in the Production of a Supertree
Despite the exactness ANCESTRALBUILDS*, it can still be used to build a supertree from incompatible source trees. Two ways are highlighted below.

  • Finding a subset of the source trees that are compatible. Given an incompatible collection Formula of source trees, finding a maximum-sized subset of trees in Formula that are compatible is an NP-hard task (Bryant, 1997). However, heuristic methods can be easily implemented: (i) rank all trees in Formula according to their size, or to some confidence value on the trees (e.g., bayesian posterior probabilities) or in the primary data set from which they were obtained; (ii) build a compatible collection Formula 'subeqFormula in the following way: starting from the best ranked tree, consider each source tree of Formula successively and add it to Formula ' if it forms a compatible collection with the trees already in Formula ', which is checked by ANCESTRALBUILDS*. At the end of the process, Formula ' is a subset of compatible source trees, a supertree of which is provided by the final call to ANCESTRALBUILDS*.
  • Finding parts of the source trees that are compatible. Usually, source trees result from an extensive analysis of primary data and their clades are provided with associated confidence values, such as bootstrap values or bayesian posterior probabilities. As a first approximation, we may assume that these confidence values are representative in some sense of the correctness of the corresponding clades (see, e.g., Berry and Gascuel [1996] for a discussion). Thus, when source trees are incompatible, a reasonable option is to first put into question the clades of the source trees that display the least support from the data. This suggests an intuitive and simple scheme to remove conflicts from the source trees by collapsing some of the clades from consideration: Let Formula be the list of support values for clades of the source trees, sorted by increasing order of confidence. Note that a clade appearing in different trees with different confidence values can be accounted for by resorting to suitable weighting schemes. Collapse clades of the source trees whose support value is equal to the first value of Formula and remove that value from the list. Then iterate until the modified source trees are compatible. Compatibility is checked every time by using ANCESTRALBUILDS*, the final call providing a supertree from the collection of modified trees.


    Appendix 1. Correctness of Ancestralbuild*
 Top
 Abstract
 Preliminaries
 Ancestralbuild
 Improving the Running Time...
 Computing the Arc Components...
 Using a Dynamic Data...
 Complexity of Ancestralbuild*
 A Comparison of Running...
 An Example on Primates
 Discussion
 Appendix 1. Correctness of...
 Appendix 2. Implementation...
 Acknowledgments
 References
 
To ease reading in the following proofs, we view a node {ell} of a restricted descendancy graph that is not a most recent common ancestor label as a single-element set. Theorem 8 follows from Proposition 4 and Lemma 12.

Lemma 11. Let Formula be a collection of rooted semilabeled trees on X, and let Formula ' be a collection of rooted fully labeled trees that is obtained from Formula by adding most recent common ancestor labels. Let Formula 1isin Formula ', and suppose that a,bisin Formula (Formula 1){cap} X such that a and b are not comparable in Formula 1. If {ell}isin Formula (Formula 1) is an ancestor label of both a and b in Formula 1, then there is a path in C(Formula ') from a to b in which all nodes on this path are descendant labels of {ell} in Formula 1.

Proof
Without loss of generality, we may assume that {ell} labels the root of Formula 1. Furthermore, since for any element zisin Formula (Formula 1){cap} X, there is a path in C(Formula ') from z to any of its descendant labels in Formula (Formula 1){cap} X, we may also assume that Formula (Formula 1){cap} X bijectively labels the leaves of Formula 1. This implies that Formula 1 has no degree 2 nodes.

We prove the lemma by showing that, for any pair of elements x and y in Formula (Formula 1){cap} X, there is a path joining this pair in C(Formula ') with the property that all nodes on this path are in Formula (Formula 1){cap} X. The proof is by induction on the distance d from mrcaFormula 1(x,y) to the root of Formula 1. Suppose that the height of Formula 1 is h. If d = h–1, then, by the definition of C(Formula '), x and y are joined by an edge in C(Formula ') and so the result holds. Now assume that d = hk, where 2 ≤ k ≤ h, and that, for all pairs of elements, w and z say, in Formula (Formula 1){cap} X for which the distance from mrcaFormula 1(w,z) to the root is greater than hk, there is a path in C(Formula ') from w to z that only uses elements in Formula (Formula 1){cap} X.

Let {ell} and {ell}' be the ancestor labels of x and y in Formula (Formula 1) that label the sibling nodes of the node of Formula 1 corresponding to the most recent common ancestor of x and y. Then, by the induction assumption, there is a path in C(Formula ') from x to an element in {ell} using just elements in Formula (Formula 1){cap} X and there is a path in C(Formula ') from an element in {ell}' to y using just elements in Formula (Formula 1){cap} X. Furthermore, by the definition of the edge set of C(Formula '), there is an edge joining each element in {ell} with each element in {ell}' in C(Formula '). Hence there is a path from x to y in C(Formula ') of the desired type. This completes the proof of the lemma.

Lemma 12. Let Formula be a collection of rooted semilabeled trees on X, and let Formula ' be a collection of rooted fully labeled trees that is obtained from Formula by adding most recent common ancestor labels. Suppose that DESCENDANT* is applied to D*(Formula ') and Formula , and that at some iteration, i say, Di and Ci are the inputted restrictions of D*(Formula ') and Formula , respectively, with V(Di){cap} X = V(Ci) and the following properties holding:

  1. For all a,cisin X, there is a directed path of arcs in Di from c to a with no element xisin X as a non terminal node in this path if and only if there is a type (i) edge joining c and a in Ci.
  2. The set of type (ii) edges of Ci is the disjoint union of sibling edge sets. Furthermore, if e is an edge of DESCENDANT*(Formula ') as a result of Formula 1isin Formula ', then e is an edge in Di if and only if each of the edges in the sibling edge set of e relative to Formula 1 are in Ci.
Let S0 denote the set of nodes of Di with indegree zero and no incident edges. Then, in reference to DESCENDANT*,
  1. After Step 3'(a) is completed, if S1, Formula 2,..., Formula k are the node sets of the arc components of Di\Formula 0, then S1{cap} X, Formula 2{cap} X,..., Formula k{cap} X are the node sets of the blue components of Ci\(Formula 0{cap} X).
  2. Before Step 3'(c) is performed, an edge e = {{ell},{ell}'} of Di\Formula 0 joins two arc components if and only if, for each sibling edge set of e, each edge in this set is coloured red and joins two blue components in Ci\(Formula 0{cap} X) with the labels in {ell} in one blue component and the labels in {ell}' in the other blue component.
  3. After Step 3'(c) is completed, for each arc component of Di\Formula 0 and the corresponding blue component of Ci\(Formula 0{cap} X), (i) and (ii) still hold.

Proof
We first prove (I). As every directed path in Di\Formula 0 must end with an element of X, it follows that, for all i, we have that Si{cap} X is nonempty.

Let Formula be the node set of a blue component of Ci\(Formula 0{cap} X). Let a and b be nodes in Formula , and suppose that a and b are joined by a blue edge. Then either

  1. b is an ancestor of a or a is an ancestor of b in some tree in Formula , or
  2. {a,b} is an element of a sibling edge set of an edge, e say, in D*(Formula ').
Because Di and Ci satisfy (i), it follows that in case (a) there is either a directed path from b to a or a directed path from a to b in Di\Formula 0. In both cases, a and b are in the same arc component of Di\Formula 0. If (b) holds, then, by (ii), e appears in Di\Formula 0. Furthermore, as {a,b} is blue, the parent node of {a,b} in Di is not an element of S0. It follows that, in Di\Formula 0, there is an arc from this parent node to one end of e and an arc from this parent node to the other end of e. In particular, this means that a and b are in the same arc component of Di\Formula 0. It now follows that Formula is a subset of the node set of some component, Sj say, of Di.

To complete the proof, we next show that Sj{cap} X is not the (disjoint) union of the node sets of two or more blue components of Ci\(Formula 0{cap} X). To see this, assume that this happens. Then, in the arc component of Di\Formula 0 whose node set is Sj, there is a path of arcs (not necessarily a directed path) from a node c that is in the node set of one blue component of Ci\(Formula 0{cap} X) to a node d that is in the node set of another blue component of Ci\(Formula 0{cap} X). Without loss of generality, we may assume that amongst all such pairs of blue components in Ci\(Formula 0{cap} X) and pairs of elements of X this path is of minimum length. By minimality, apart from c and d, there are no other elements of X on this path. But then each of the non-terminal nodes on this path are elements in Formula (Formula ')–X. Beacuse each of these labels across all trees in Formula ' are distinct, we deduce that each of these labels come from the same tree, Formula 1 say, in Formula '. If d is an ancestor of c or c is an ancestor of d in Formula 1, it follows by the minimality condition that {c,d} is a blue edge in Ci\(Formula 0{cap} X); a contradiction. So assume that c and d are noncomparable in Formula 1. By considering the directions of the arcs in the path of directed edges between c and d, it is easily seen that one node on this path, {ell} say, is an ancestor label of both c and d in Formula 1. Because the node {ell} appears in Di\Formula 0, it follows by the fact that V(Di){cap} X = V(Ci) that each of its descendant labels in Formula (T1){cap} X are in Ci\(Formula 0{cap} X). From Lemma 11, we now deduce that there is a path of blue edges in Ci\(Formula 0{cap} X) from c to d, and so c and d are in the same blue component of Ci\(Formula 0{cap} X). This contradiction completes the proof of (I).

To prove (II), first suppose that before Step 3'(c) is performed e = {{ell},{ell}'} is an edge of Di\Formula 0 joining two arc components. Then no parent node of any sibling edge set of e is in the node set of Di\Formula 0 and so, if it exists, every edge in each of these sets is colored red in Ci\(Formula 0{cap} X). Beacuse {{ell},{ell}'} is an edge of Di\Formula 0, every element in {ell}{cup}{ell}' is in the node set of Ci\(Formula 0{cap} X) and so, by (ii), we deduce that all such edges exist. Now consider the elements in {ell}. If {ell} is a singleton, then, trivially, the label in {ell} is in a single arc component. Therefore, assume that {ell} contains two labels x and y. Then {ell} corresponds to the most recent common ancestor of x and y in some tree in Formula '. It follows that x and y are in the same arc component of Di\Formula 0. Applying the same arguments to {ell}' and using (I), we deduce one direction of (II).

For the converse of (II), suppose that, for each sibling edge set of e, each of the edges in this set are colored red and join two blue components in Ci\(Formula 0{cap} X) with the labels in {ell} in one blue component and the labels in {ell}' in the other blue component. Then it is immediate from (I) that {{ell},{ell}'} joins two arc components in Di\Formula 0. This completes the proof of (II).

For (III), consider a component D of Di\Formula 0 and the corresponding component, C say, of Ci\(Formula 0{cap} X). First assume that there is a directed path of arcs in D from c to a with no element xisin X as a nonterminal node. Then, as CFormula Formula 0, we have that c and a are elements of C. Because (i) holds for Di and Ci, it immediately follows there is an arc joining c and a in C. A similar argument shows that the converse also holds. Thus D and C satisfy (i).

Now consider (ii) in the statement of the lemma. First observe that, as the set of type (ii) edges of Ci is the disjoint union of sibling edge sets, it follows by (II) and the way in which type (ii) edges of Ci and edges of Di are deleted that the set of type (ii) edges of C is the disjoint union of sibling edge sets. Now assume that e is an edge in D with ends {ell} and {ell}'. Then, by the construction of D*(Formula '), it follows that {ell}{cup} {ell}' is a subset of the node set of D. As all of the elements in {ell}{cup} {ell}' are in X, we deduce by (I) that {ell}{cup} {ell}' is a subset of the node set of C. Since (ii) holds for Di and Ci, it follows that each of the edges in the sibling edge sets of e are in C. A similar argument shows that the converse also holds.


    Appendix 2. Implementation Details
 Top
 Abstract
 Preliminaries
 Ancestralbuild
 Improving the Running Time...
 Computing the Arc Components...
 Using a Dynamic Data...
 Complexity of Ancestralbuild*
 A Comparison of Running...
 An Example on Primates
 Discussion
 Appendix 1. Correctness of...
 Appendix 2. Implementation...
 Acknowledgments
 References
 
We give here a description of ANCESTRALBUILDS* that is more tuned towards implementation. Data structures maintained throughout the algorithmic process are first detailed, then a pseudo code details their precise interactions in the DESCENDANT* subroutine.

Data Structures
Formula is a graph with node set X and whose edges can be one of two colors (blue or red). The nodes of Formula are stored in an adjacency list. For each node v, we store its blue neighbours (i.e., the nodes connected to v by a blue edge) in a doubly linked list B(v) and its red neighbors in a doubly linked list R(v). These neighbor relations code for the edges of Formula . Note that an edge {u,v} of a given color, say blue, is represented twice, because both uisin B(v) and visin B(u).

D*(Formula ') is a mixed graph with node set Formula (Formula '). For each node of D*(Formula '), we store its indegree, its list of out-going arcs, and the list of nodes to which it is connected by an edge. For each node, we also maintain a list of pointers to the sibling edges of which it is parent. For each sibling edge of D*(Formula '), we maintain a list of pointers to the corresponding (one to four) edges in Formula . This pointer is doubled so that these edges in Formula know in constant time their corresponding sibling edge in D*(Formula '). Thus, changing in Formula the color of sibling edges {u,v} of a node xisin\Formula is done in constant time for each edge (recall that B(v), and similarily B(u), is a doubly linked list, so that removing a pointed element out of it and putting it in R(v) is done in constant time).

numC is a table indicating for each node s in Formula the number of its current blue component in Formula . Initially, all elements belong to the same component (without loss of generality, we may assume that Formula is initially connected).

comp is a table storing the blue components of Formula . Each component is coded as a doubly linked list of nodes of X and is stored in the entry corresponding to its number.

Pseudocode for the DESCENDANT* Subroutine
The heading of the subroutine is the following:


Figure 12
View larger version (23K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
 
First note that the whole graphs D*(Formula ') and Formula , and not part of them, are indicated as the formal parameters. Indeed, to avoid unnecessary copies of parts of these graphs, it is simpler that all calls to the subroutine access the same shared data structure. Each call to the subroutine has to work on a single arc component of D*(Formula ') whose number n0 is inputted to the subroutine.


Figure 13
View larger version (35K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
 
Deleting nodes of Sn0 from D*(Formula ') can lead to other nodes of the n0th arc component to have indegree 0 (and no incident edge). However, these nodes are put aside (until the end of Step 3'(a)(iii)) in a list Sn0* and not directly in S0 to avoid confusion between the elements in that set at the beginning of the on-going call, and elements to be processed in the next recursive call for the same arc component.

Also recall that removing nodes from Formula at Step 3'(a)(ii) (namely, nodes in Sn0{cap} X) does not change the set of blue components of the graph.


Figure 14
View larger version (28K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
 
S is a table, whose ith entry, Si, is the doubly linked list of nodes of the ith arc component that have indegree 0 and no incident edge. If such a node of D*(Formula ') has a label in X, then there is a link between this node in Si and the corresponding node of Formula stored in the list Compi, which stores the nodes of the ith blue component. Thus, when a node of Formula is moved from the ith blue component to the jth one (just created because of the deletion of an edge), the corresponding node of D*(Formula ') moves from Si to Sj in constant time.

LnextC is the list of new blue components in Formula (corresponding to the new arc components in D*(Formula ') created during the on-going execution of DESCENDANT*.

Formula is the dynamic connectivity algorithm supporting deletion updates and connectivity queries.

Step 3'(b) identifies new blue components of Formula resulting from the red coloring given to some of its edges. When a former blue component is broken into two blue components, nodes of the smallest one are identified efficiently according to Even and Shiloach (1981). This smallest component receives a new number nc and its nodes are transferred from the old component to the list in Comp corresponding to this new component.


Figure 15
View larger version (23K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
 
Because of Lemma 12(II), edges between arc components of D*(Formula ') correspond exactly to red edges between the corresponding blue components of Formula . At the beginning of the current call of DESCENDANT*, there is no red edge between two blue components in Formula . Thus, such red edges only exist between new blue components created at this step, more precisely, between those stored in LCnext, and the other new (larger) blue components. This guides the code of Step 3'(c), see above.


Figure 16
View larger version (12K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
 
Note that, in the case where Formula ' is compatible, Step 4 issues a recursive call for each new component created by the on-going execution of DESCENDANT*, as well as for the component Cn0. Indeed, this latter component still contains nodes. Note that some of these nodes that were not in Sn0 at the beginning of the ongoing call can now be in this set because initial nodes of this set and some edges have been removed from this arc component.


    Acknowledgments
 Top
 Abstract
 Preliminaries
 Ancestralbuild
 Improving the Running Time...
 Computing the Arc Components...
 Using a Dynamic Data...
 Complexity of Ancestralbuild*
 A Comparison of Running...
 An Example on Primates
 Discussion
 Appendix 1. Correctness of...
 Appendix 2. Implementation...
 Acknowledgments
 References
 
The authors are grateful to E. Douzery and P.-H. Fabre for indicating studies on the strepsirrhine evolution used in this paper and for verifying our findings. We thank Olaf Bininda-Emonds, Rod Page, Gabriel Valiente, and an anonymous referee for their valuable comments. We expect the idea of Bininda-Emonds regarding a reference taxonomy as additional input to play an important practical role in the use of supertree algorithms. The first author was supported by the Act. Incit. Inf.-Math.-Phys. en Biol. Mol. [ACI IMP-Bio] and the Act. Inter. Incit. Région. [BIOSTIC-LR]. The second author was supported by the New Zealand Marsden Fund and a University of Canterbury Erskine Grant. This work was done while the second author was a visiting professor at the Université of Montpellier II.


    References
 Top
 Abstract
 Preliminaries
 Ancestralbuild
 Improving the Running Time...
 Computing the Arc Components...
 Using a Dynamic Data...
 Complexity of Ancestralbuild*
 A Comparison of Running...
 An Example on Primates
 Discussion
 Appendix 1. Correctness of...
 Appendix 2. Implementation...
 Acknowledgments
 References
 

    Aho A. V., Sagiv Y., Szymanski T. G., Ullman J. D. Inferring a tree from lowest common ancestors with an application to the optimization of relational expressions. SIAM J. Comput. (1981) 10:405–421.[CrossRef]

    Berry V., Gascuel O. On the interpretation of bootstrap trees: Appropriate threshold of clade selection and induced gain. Mol. Biol. Evol. (1996) 13:999–1011.[Abstract/Free Full Text]

    Bininda-Emonds O. R. P., ed. Phylogenetic supertrees: Combining information to reveal the Tree of Life (2004) Dordrecht: Kluwer.

    Bininda-Emonds O. R. P., Jones K. E., Price S. A., Cardillo M., Grenyer R., Purvis A. Garbage in, garbage out. In: Phylogenetic supertrees: Combining information to reveal the Tree of Life—Bininda-Emonds O. R. P., ed. (2004) Dordrecht: Kluwer. Pages 267–280.

    Bordewich M., Evans G., Semple C. Extending the limits of supertree methods. Ann. of Comb (2005) (in press).

    Bryant D. Building trees, hunting for trees, and comparing trees: Theory and methods in phylogenetic analysis (1997) University of Canterbury. Ph.D. thesis.

    Daniel P., Semple C. Supertree algorithms for nested taxa. In: Phylogenetic supertrees: Combining information to reveal the Tree of Life—Bininda-Emonds O. R. P., ed. (2004) Dordrecht: Kluwer. Pages 151–171.

    Daniel P., Semple C. A class of general supertree methods for nested taxa. SIAM J. Discrete Math. (2005) 19:463–480.[CrossRef]

    Even S., Shiloach Y. An on-line edge deletion problem. J. Assoc. Comput. Machinery (1981) 28:1–4.

    Grünewald S., Steel M., Swenson M. Closure operations in phylogenetics (2005) University of Canterbury. Tech. rep.

    Henzinger M. R., King V., Warnow T. Constructing a tree from homeomorphic subtrees, with applications to computational evolutionary biology. Algorithmica (1999) 24:1–13.[CrossRef][Web of Science]

    Hibbett D., Nilsson R. H., Snyder M., Costanzo M. F. J., Shonfeld M. Automated phylogenetic taxonomy: An example in the homobasidiomycetes (mushroom-forming fungi). Syst. Biol. (2005) 54:660–668.[Free Full Text]

    Holm J., de Lichtenberg K., Thorup M. Poly-logarithmic deterministic fully dynamic algorithms for connectivity, minimum spanning tree, 2-edge, and biconnectivity. (1998) Proceedings of the 30th Annual ACM Symposium on Theory of Computing. New York: ACM Press. Pages 78–89.

    Llabrés M., Rocha J., Rosselló F., Valiente G. On the ancestral compatibility of two phylogenetic trees with nested taxa. (2005) Submitted.

    Masters J. C., Brothers D. J. Lack of congruence between morphological and molecular data in reconstructing the phylogeny of the galagonoidae. Am. J. Phys. Anthropol. (2002) 117:79–93.[CrossRef][Web of Science][Medline]

    Page R. D. M. Modified mincut supertrees. In: Second International Workshop on Algorithms in Bioinformatics—Guig R., Gusfield D., eds. (2002) New York: Springer. Pages 537–552.

    Page R. D. M. Taxonomy, supertrees, and the tree of life. In: Phylogenetic supertrees: Combining information to reveal the Tree of Life—Bininda-Emonds O. R. P., ed. (2004) Dordrecht: Kluwer. Pages 247–265.

    Page R. D. M., Valiente G. An edit script for taxonomy classifications. BMC Bioinformatics (2005) 6.

    Roos C., Schmitz J., Zischler H. Primate jumping genes elucidates strepsirrhine phylogeny. Proc. Nat. Acad. Sci. (2004) 29:10650–10654.

    Sanderson M. J., Baldwin B. G., Bharathan G., Campbell C. S., Ferguson D., Porter J. M., Dohlen C. V., Wojciechowski M. F., Donoghue M. J. The growth of phylogenetic information and the need for a phylogenetic database. Syst. Biol. (1993) 42:562–568.[Free Full Text]

    Semple C., Steel M. A supertree method for rooted trees. Discrete Appl. Math. (2000) 105:147–158.[CrossRef]

    Semple C., Steel M. Phylogenetics (2003) Oxford: Oxford University Press.

    Yoder A. D., Rasoloarison R. M., Goodman S. M., Irwin J. A., Atsalis S., Ravosa M. J., Ganzhorn J. U. Remarkable species diversity in malagasy mouse lemurs (primates, microcebus). Proc. Nat. Acad. Sci. (2000) 97:11325–11330.[Abstract/Free Full Text]

    Zaroliagis C. D. Implementations and experimental studies of dynamic graph algorithms. In: Experimental algorithms: From algorithm design to robust and efficient software (2002) Vol. 2547. New York: Lecture Notes in Computer Science, Springer. Pages 229–278.


Add to CiteULike CiteULike   Add to Connotea Connotea   Add to Del.icio.us Del.icio.us    What's this?


This article has been cited by other articles:


Home page
Syst BiolHome page
V. Ranwez, V. Berry, A. Criscuolo, P.-H. Fabre, S. Guillemot, C. Scornavacca, and E. J. P. Douzery
PhySIC: A Veto Supertree Method with Desirable Properties
Syst Biol, October 1, 2007; 56(5): 798 - 817.
[Abstract] [Full Text] [PDF]


This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (PDF) Freely available
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in ISI Web of Science
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrow Search for citing articles in:
ISI Web of Science (4)
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Berry, V.
Right arrow Articles by Semple, C.
Right arrow Search for Related Content
PubMed
Right arrow Articles by Berry, V.
Right arrow Articles by Semple, C.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?