Skip Navigation

Systematic Biology 2007 56(1):57-67; doi:10.1080/10635150601167013
This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (PDF) Freely available
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in ISI Web of Science
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrow Search for citing articles in:
ISI Web of Science (6)
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Holland, B.
Right arrow Articles by Moulton, V.
Right arrow Search for Related Content
PubMed
Right arrow Articles by Holland, B.
Right arrow Articles by Moulton, V.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?

© 2007 Society of Systematic Biologists

Imputing Supertrees and Supernetworks from Quartets

Edited by Rod Page: Associate Editor

B. Holland1, Glenn Conner2, Katharina Huber3 and V. Moulton3

1 Allan Wilson Centre for Molecular Ecology and Evolution, Massey University Palmerston North, New Zealand E-mail: B.R.Holland{at}massey.ac.nz
2 Massey University Palmerston North, New Zealand
3 School of Computing Sciences, Untiversity of East Anglia Norwich, UK


    Abstract
 Top
 Abstract
 Methods
 Results
 Discussion
 Appendix
 References
 
Inferring species phylogenies is an important part of understanding molecular evolution. Even so, it is well known that an accurate phylogenetic tree reconstruction for a single gene does not always necessarily correspond to the species phylogeny. One commonly accepted strategy to cope with this problem is to sequence many genes; the way in which to analyze the resulting collection of genes is somewhat more contentious. Supermatrix and supertree methods can be used, although these can suppress conflicts arising from true differences in the gene trees caused by processes such as lineage sorting, horizontal gene transfer, or gene duplication and loss. In 2004, Huson et al. (IEEE/ACM Trans. Comput. Biol. Bioinformatics 1:151–158) presented the Z-closure method that can circumvent this problem by generating a supernetwork as opposed to a supertree. Here we present an alternative way for generating supernetworks called Q-imputation. In particular, we describe a method that uses quartet information to add missing taxa into gene trees. The resulting trees are subsequently used to generate consensus networks, networks that generalize strict and majority-rule consensus trees. Through simulations and application to real data sets, we compare Q-imputation to the matrix representation with parsimony (MRP) supertree method and Z-closure, and demonstrate that it provides a useful complementary tool.

Keywords: Consensus networks; consensus trees; genome phylogeny; phylogenetic networks; phylogenetic trees; supernetworks; supertrees

Received April 5, 2006; Revised June 14, 2006; Accepted August 25, 2006


Inferring species phylogenies is an important part of understanding molecular evolution. Even so, it is well known that an accurate phylogenetic tree reconstruction for a single gene does not necessarily correspond to the species phylogeny. This may be due, for example, to error in estimating the gene tree or to the fact that the gene tree is actually different from the species tree, due to, e.g., lineage sorting, horizontal gene transfer, or gene duplication and loss (Linder and Rieseberg, 2004). One commonly accepted strategy to cope with this problem is to sequence many genes. However, the way in which to analyze the resulting collection of genes is somewhat more contentious. In particular, two different approaches are commonly employed: concatenation and consensus. In the first case, gene sequence alignments are concatenated and a species phylogeny is estimated from the resulting alignment, whereas in the second, individual gene trees are estimated for each gene alignment and a consensus species tree is constructed.

A major problem with both of these approaches is that it might not be possible to sequence certain genes for all of the taxa being studied (perhaps because some genes are missing or because sequencing techniques are inadequate). To deal with such missing genes, supermatrix methods (in the concatenation framework) and supertree methods (in the consensus framework) are commonly applied to construct a phylogeny—see Gatesy et al. (2002), Gatesy et al. (2004), and Bininda-Emonds (2004a, 2004b) for some recent discussion on the pros and cons of these approaches. Even so, because gene trees might not agree—reflecting in some cases true differences in the evolution of the genes rather than phylogenetic error—it can be important to display the conflicts between gene trees rather than suppressing them in order to produce a single polytomous tree.

Split networks (Bandelt and Dress, 1992) provide a general framework for doing precisely this (see, e.g., Posada and Crandall, 2001; Huber and Moulton, 2005; Huson and Bryant, 2006). For example, Huson et al. (2004) present a method that takes as input a set of (partial) gene trees and produces a split network or supernetwork, which—as with supertrees—pieces together the signals present in the gene trees, and—unlike supertrees—simultaneously allows the representation of conflicting signals. Here we present an alternative method for constructing supernetworks called Q-imputation. In particular, using imputation (Little and Rubin, 1987)—a statistical technique that is commonly used in situations where there is missing data—we provide a quartet-based adaptation of the consensus network methodology described in Holland et al. (2004, 2005) that can cope with missing taxa.


    Methods
 Top
 Abstract
 Methods
 Results
 Discussion
 Appendix
 References
 
We now describe the quartet-based imputation procedure for adding taxa into gene trees that forms the basis of our new method. Note that in the phylogenetic context, imputation techniques have already been applied to complete distance matrices (Makarenkov and Lapointe, 2004).

Phylogenetic Trees and Quartets
Before beginning, we require some terminology for phylogenetic trees—see Semple and Steel (2004) for more details. A phylogenetic tree T is a leaf-labeled tree. We say that T is binary if every internal node of T has degree 3, and we denote the labels of the leaf set of T by L(T). A quartet is a phylogenetic tree that has four leaves. Note that there are four possible quartets on any set of four taxa, three of which are binary and one of which is a star.

For a phylogenetic tree T and a subset Z of L(T) we let T|Z denote the phylogenetic tree obtained by restricting T to Z and suppressing any vertices of degree 2. Moreover, we denote by Q(T) the collection of quartets induced by restricting T to all possible subsets of L(T) containing four elements, and for x isin L(T), we let Qx(T) denote the subset of quartets in Q(T) each of which has x as one of their labels.

Q-imputation
Let T1,T2,...,Tg be a collection of phylogenetic trees, corresponding to a given collection of gene trees. Put X = {cup}i = 1gL(Ti), that is, the union of the leaf sets of the trees T1,T2,...,Tg. For each tree Ti, 1 ≤ i ≤ g, our goal is to sequentially insert all of the taxa in XL(Ti) into Ti until we obtain a phylogenetic tree Ti* with leaf set X. Once the trees T1*,T2*,...,Tg* have all been obtained, we then apply the consensus network method to this collection to obtain a phylogenetic network (see below for details concerning consensus networks).

We now describe the algorithm—the Q-imputation algorithm—that we use to obtain the trees T1*,T2*,...,Tg* (see Fig. 1 and Fig. 2). We run through each of the trees T1...,Tg in turn (loop Steps 1–16), completing each tree Ti to the tree Ti*. In particular, we first check if the tree Ti has leaf set X, in which case we put Ti* = Ti (Step 2); otherwise, we put T' = Ti (Step 3).


Figure 1
View larger version (72K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Figure 1 The Q-imputation algorithm.

 


Figure 2
View larger version (47K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Figure 2 Computing the best position to insert taxon F into tree T1 in the Q-imputation algorithm (see text and Fig. 1 for details). The inset table shows the component of S1(F,p) that is derived from comparison with each of the other trees T2, T3, and T4. The two positions e1 and e6 have equally high scores (shown in bold).

 
Then for each taxon y in XL(T') (loop Steps 4–14) we choose a position in T' in which to insert y. In particular, for each position p of T' (loop Steps 6–12)—that is, either the centre of an edge in T' or an interior vertex of T'—we insert a pendant edge into T' labeled by y to obtain a tree T'(y,p) (see Fig. 2). We then choose taxon y and position p that maximizes the score


Formula 1

(1)
that is, the total number of quartets containing y induced by T'(y,p) and Tj (the score is computed in the loop Steps 8–10, and the largest score updated in Step 11). Note that unresolved quartets are also taken into account by this score function as we treat polytomies as being hard rather than soft. Moreover, we do not normalize |Qy[T'(y,p)] {cap} Qy(Tj)|, as small trees would tend to overly influence the score.

If the maximum score obtained is 0, then we have insufficient information to proceed with our approach, and so we stop and report that the algorithm is unable process the input trees (Step 13). Otherwise, for a pair y,p that maximizes S[T'(y,p)], we set T' = T'(y,p) (Step 11). Note that if there is a tie, that is, two or more pairs y,p with highest score, we break this arbitrarily. The whole process of inserting a taxon y isin XL(T') into T' is then repeated until we obtain L(T') = X, at which point we set Ti* = T' (Step 15).

We now give a rough upper bound on the time complexity of the Q-imputation algorithm. Let n = |X| and d = max i = 1g {|X|–|L(Ti)| }. The loop Steps 1–16 are executed O(g) times, the loop Steps 4–14at most O(d) times, the loop Steps 6–12 O(n) times, and the loop Steps 8–10 O(g) times. It is straightforward to perform Step 9 in O(n3) operations—essentially because the inserted leaf y is fixed leaving (Formula ) quartets to be compared. It follows that the Q-imputation algorithm is O(g2dn4). In particular, it has polynomial run time, although in practice it will probably be necessary to develop a more sophisticated version of the Q-imputation algorithm for processing large trees. Such an algorithm could, for example, take into account previously computed quartet information when computing the score function, or in Step 9 incorporate adaptations of fast algorithms for computing the quartet-distance between trees (see, e.g., Christiansen et al., 2005). Note that because the Q-imputation algorithm processes one tree at a time, it would also be a simple matter to parallelise the algorithm.

Consensus Networks
Once a collection of trees T1*,..., Tg*, each having leaf set X, has been computed using the Q-imputation algorithm we compute a consensus network for this set, as described in Holland and Moulton (2003) and Holland et al. (2005). Consensus networks generalize strict and majority-rule consensus trees by allowing the representation of conflicting information supported by the input trees that cannot be displayed in a single tree. We now briefly describe how such networks are generated.

Each edge of a phylogenetic tree naturally gives rise to a bipartition or split A|B of its leaf set since its removal results in two trees, each one labeled by the elements of one part (A or B) of the split. We say that a phylogenetic tree T displays a split if there is an edge in the tree that gives rise to the split, and we let {Sigma}(T) denote the set of splits displayed by T. Even when the split system does not correspond to a tree, i.e., it is not compatible, we can still represent it by a split network in which splits are displayed by parallel edges. These can be generated by SplitsTree4—see Huson and Bryant (2006) for more details—and by Spectronet (Huber et al., 2002).

Now, for the collection of trees T1*,..., Tg* computed by Q-imputation, the consensus network is the split network representing the split system {Sigma}Qt consisting of those splits of X that are displayed by more than a certain proportion, t, of the trees T1*,..., Tg* (note that in case t = 0 we drop the subscript in {Sigma}Qt). For example: if t = 100, then the consensus network is a strict-consensus tree because only those splits that are displayed by every tree (i.e., 100% of the trees) are displayed by the network; if t = 50, then the consensus network is the majority-rule consensus tree; and if t < 50, then the consensus network may display conflicting splits.

Some Properties of the Q-Imputation Algorithm
We now investigate some theoretical properties of Q-imputation. The following result indicates how this algorithm performs when all of the input trees are subtrees of some phylogenetic tree.

Theorem Suppose that T1,..., Tg is a collection of phylogenetic trees that are are all subtress of some phylogenetic tree T. If the Q-imputation algorithm in Figure 1 runs to completion on this collection without encountering a tie, then its output T1*,..., Tg* satisfies T1* = ... = Tg* = T.

The proof of this result is given in Appendix. It immediately follows from the theorem that if the Q-imputation algorithm is run on a collection of phylogenetic trees each of which is a subtree of some phylogenetic tree T, then {Sigma}Qt = {Sigma}(T) for all t, and so the consensus network equals T. Note that the above theorem no longer holds if the Q-imputation algorithm encounters ties.

In general, {Sigma}Qt will not be displayed by a tree, and so it is of interest to consider other properties that this collection of splits might satisfy. Some studies have looked at the performance of supertree methods on simulated and real data with respect to the presence of unsupported clades (e.g., Bininda-Emonds and Bryant, 1998; Bininda-Emonds, 2003; Wilkinson et al., 2005a)—note these were referred to as "novel clades" in Bininda-Emonds and Bryant (1998). Bininda-Emonds (2003) defines "hard conflict" as the case where a clade (equivalently split) is contradicted by every input tree. Because any split in {Sigma}Q must come from the extension of at least one of the input trees, it cannot contradict that input tree, making hard conflicts impossible.

In the context of a supernetwork construction, Huson et al. (2004) consider the weak induction property (WIP). This states that, for input trees T1,T2,...,Tg, any split S in {Sigma}Q should restrict to a split in {Sigma}(Ti) for some 1 ≤ i ≤ g. Note that splits representing hard conflicts obviously do not satisfy the WIP. By the above theorem, the WIP holds for all splits in {Sigma}Q in case T1,T2,...,Tg are all subtrees of a phylogenetic tree. However, there are examples where WIP does not hold, although in simulations we have found that very few splits are generated in general with Q-imputation that do not satisfy the WIP.

Weighting Schemes
Often the input trees T1,T2,...,Tg have edge weights corresponding to, e.g., genetic distances. In this situation it is desirable to also assign a weight to each of the splits in {Sigma}Q. These split weights will correspond to edge weights in the consensus network. Huson et al. (2004) discuss various possibilities for weighting the edges in the Z-closure network, adaptations of which we can also apply in the Q-imputation setting.

For example, if every split in {Sigma}Q satisfies the WIP, the following weighting can be computed. For each split S in {Sigma}Q, we take the average (or maximum/minimum if desired) weight of all of the splits that result when S can be restricted to give a split displayed by one of the trees T1,T2,...,Tg. Huson et al. (2004) note that these schemes have the potentially unrealistic requirement that the trees be on the same scale. To avoid this problem, they suggest using average relative weights where the contribution from each tree is scaled by the total sum of edges in that tree.

Another approach can be applied in case the input trees T1,T2,...,Tg satisfy the all-pairs property, that is, for any pair of taxa x,y in X, both x and y are contained in the leaf set of Ti for some 1 ≤ i ≤ g. In particular, we compute a distance matrix on X, in which the distance between any pair of taxa x,y is the average taken over the length of the shortest path between x and y in all of the input trees that contain both x and y. Using this distance matrix, we then compute least-squares estimates for the branch lengths in each of the trees Ti* (using, e.g., the Fitch-Margoliash method; Fitch and Margoliash, 1967) and then apply the weighted consensus network method of Holland et al. (2006) to the resulting collection of weighted trees.


    Results
 Top
 Abstract
 Methods
 Results
 Discussion
 Appendix
 References
 
In order to test the behavior and applicability of the Q-imputation method, we implemented the Q-imputation algorithm in Figure 1. A C++ program is available for download from http://www.awcmee.massey.ac.nz/ downloads (both source code and binaries). As input the program requires a text file with one tree per line in Newick format. If the algorithm runs to completion (i.e., there is sufficient overlap between the input trees) the program outputs another text file containing one completed tree per line, again in Newick format. Analysis of the resulting trees with consensus networks can be carried out using SplitsTree4 (Huson and Bryant, 2006).

Simulations
To better understand the behavior of the Q-imputation and how it compares with other methods, we carried out simulations using Q-imputation, matrix representation with parsimony (MRP) for computing supertrees (Baum, 1992; Ragan, 1992), and the Z-closure method for computing phylogenetic supernetworks (Huson et al., 2004). We chose the MRP supertree method as it is the most widely used; a comparison between MRP and other supertree methods can be found in, e.g., Wilkinson et al. (2005a).

We generated three different types of input for the three methods: (1) trees that are all displayed by an underlying species tree, to simulate a collection of gene trees with missing data but no error; (2) trees that are within the subtree-prune-regraft (SPR) neighborhood (as defined in, e.g., Swofford et al., 1996) of an underlying species tree, to simulate gene trees with both missing data and some "noise" obscuring an underlying species phylogeny; (3) unrelated trees that are generated under the Yule model (Yule, 1924), to simulate gene trees with no underlying species tree. In each simulation, three parameters were varied: (A) the species tree, which was either the completely balanced tree on 16 taxa or the completely unbalanced (caterpillar) tree on 16 taxa; (B) the number of gene trees g taking values 2, 4, 8, 16, and 32; (C) the number of taxa missing m in each tree for m taking values 1, 2, 3, 4, 5, and 6. In each case m taxa were deleted at random from each gene tree. Instances where the same taxon was missing in all gene trees were disallowed, and in such cases a new random selection of missing taxa was created. One hundred repetitions were carried out for each parameter combination. The input tree sets were generated once and used as input to MRP, Q-imputation, and Z-closure.

Each method gives rise to a split system, which can be subsequently represented by a split network. The split systems generated were (A) {Sigma}M50 and {Sigma}M100, the splits in the majority-rule consensus and strict consensus of the most parsimonious trees (MPTs) from MRP.

To generate {Sigma}M50 and {Sigma}M100, we used the hsearch function in PAUP* version 4.0b10 (Swofford, 1998) to search for MPTs and saved up to 1000 equally scoring MPTs (all other PAUP* settings were left at default values); (B) {Sigma}Q, {Sigma}Q20, and {Sigma}Q50, the split systems created using Q-imputation with thresholds 0, 20, and 50, respectively; note that {Sigma}Q50subeq {Sigma}Q20subeq {Sigma}Q, that is, the split systems are nested; (C) {Sigma}Z, the splits generated using Z-closure with 10 fixed orderings and no dimension filter. Note that {Sigma}Q50, {Sigma}M50, and {Sigma}M100 are always compatible split systems and therefore correspond to supertrees, whereas all of the other split systems that we generated may contain incompatible splits and therefore might give rise to networks.

Following Huson et al. (2004), we measured the number of false positives and false negatives produced by each method, where a false positive is any split that appears in the output split system that is not supported by any input tree (i.e., it does not restrict to a split displayed by any of the input trees), and a false negative is a split displayed by an input tree that is not the restriction of any split in the output split system. Note that any method that has the WIP property (e.g., Z-closure) cannot produce any false-positive splits. Moreover, Q-imputation with threshold value 0 cannot produce any false negatives, although it may produce false negatives when the threshold is varied.

We first consider false positives. Because Z-closure cannot generate these, we only recorded the presence of splits that do not have the WIP in {Sigma}Q, {Sigma}Q20, {Sigma}Q50, {Sigma}M100, and {Sigma}M50. We summed our results over the different species trees, the possible parameter settings for g and m, and over 100 repetitions per parameter setting, and so the results represent a sum over 6000 simulations. Complete results for each parameter setting are given in Supplementary Material (available online at http://www.systematicbiology.org). For simulation (1) false positives did not occur for either Q-imputation or MRP (as expected from the above theorem for Q-imputation). In simulation (2), false positives were generated for both MRP and Q-imputation, but only very rarely. In total, {Sigma}Q, {Sigma}Q20, and {Sigma}Q50 contained 36, 35, and 0 false positives, and {Sigma}M50 and {Sigma}M100 contained 87 and 46. In simulation (3), {Sigma}Q, {Sigma}Q20, and {Sigma}Q50 contained 56, 52, and 0 false positives in total, similar in magnitude to simulation (2), but {Sigma}M50 and {Sigma}M100 contained many more false positives (5252 and 4368). In summary, false-positive splits were most common when there are few gene trees and there were many taxa missing (see Supplementary Material).

We now consider false negatives. Results are presented in Table 1 for the numbers of false negatives that were contained in {Sigma}Q20 and {Sigma}Z (results for the other split systems are presented in Supplementary Material). In Table 1 we see that {Sigma}Q20 and {Sigma}Z exhibit quite different behaviors. For Z-closure, false negatives are most common when there are few gene trees and there are many taxa missing, presumably because there are not as many possibilities to apply the Z-closure rule. This is in agreement with a similar simulation study carried out in Huson et al. (2004) with Z-closure that indicated that the number of false negatives is reduced with increasing (i) number of gene trees, (ii) overlap between gene trees, and (iii) agreement between gene trees. In contrast, the number of false negatives contained in {Sigma}Q20 increases with increasing number of gene trees. In simulation (2), {Sigma}Q20 and {Sigma}Q50 contain fewer false negatives than in simulation (3); the splits that are contained in these split systems presumably correspond to the splits displayed by the underlying species tree, which appear more frequently in trees in the SPR neighborhood. Based on the results for simulation (3), as the number of gene trees increases Z-closure will tend to produce increasingly complex networks, whereas Q-imputation will tend to produce sparse networks that are close to the star tree.


View this table:
[in this window]
[in a new window]

 
Table 1. False negatives for simulations with (1b) all gene trees displayed by a balanced underlying species tree; (1u) all gene trees displayed by an unbalanced underlying species tree; (2b) all gene trees displayed by a tree in the SPR neighborhood of a balanced underlying species tree; (2u) all gene trees displayed by a tree in the SPR neighborhood of an unbalanced underlying species tree; (3) random Yule trees. The numbers in boldface refer to the number of gene trees (2–32) and the number of missing taxa per tree (1–6).

 
For simulations (1) and (2) we also considered an alternative definition of false negatives and false positives that takes into account that there is an underlying species tree Formula . In particular, Type I splits are those contained in the output split system that are not in {Sigma}(Formula ), and Type II those splits in {Sigma}(Formula ) that are not in the output split system. We compared the performance of MRP, Q-imputation, and Z-closure with these alternative definitions (full results given in Supplementary Material). Typically, {Sigma}Q50 contained more Type II splits and fewer Type I splits than {Sigma}M50. We also found that {Sigma}Q20 tended to contain fewer Type II splits than {Sigma}Z over almost all parameter choices. Moreover, {Sigma}Q20 tended to contain fewer Type I splits than {Sigma}Z when there were more than 4 gene trees, but more when there were only 2 gene trees. As expected, all methods generated fewer Type I splits when there were fewer missing taxa. The number of Type II splits in {Sigma}M50, {Sigma}M, {Sigma}Q20, and {Sigma}Q50 decreased with increasing number of genes, whereas with {Sigma}Z this number increased.

A Collection of Real Data Sets
We collected 14 data sets that have previously been used in the study of supertrees and supernetworks. These varied in terms of the number of source trees, number of taxa and the average coverage (i.e., the amount of missing data). The sources are given in the caption of Table 2. Table 2 is a summary of the performance of the different methods on these data sets, indicating the number of false positives and false negatives and also the dimension of the resulting split system (the dimension of a split system is the cardinality of the largest subcollection of splits in which every pair of splits is not compatible: a dimension k split system implies a consensus network containing a k-dimensional hypercube so that, in particular, a split system with dimension 1 corresponds to a tree). Note that the dimension of the {Sigma}Q50, {Sigma}M50, and {Sigma}M100 is always 1 and is thus not shown in the table.


View this table:
[in this window]
[in a new window]

 
Table 2. A summary of the performance of MRP, Q-imputation, and Z-closure on a number of data sets that have previously been studied in a supertree or supernetwork context. Data sources are indicated in parentheses after the name of the data set: (1) Bininda-Emonds et al., 1999; (2) Wilkinson et al., 2005b; (3) Huson et al., 2004; (4) Kennedy and Page, 2002; (5) Banks and Whitfield, 2006; (6) Huson et al., 2006. For each data set the number of source trees and the number of taxa are indicated. Coverage indicates the average proportion of taxa present in the source trees, the values in brackets show minimum and maximum values. The table indicates the number of false positives (FP), the number of false negatives (FN), and the dimension of the split-system (where 5+ indicates dimension 5 or higher).

 
There is a tradeoff between ease of visualization (low dimension) and the number of false negatives. False positives are rare in {Sigma}Q and are nearly eliminated if a threshold is used. The maximum possible dimension of the split system {Sigma}Q20 is 4 (by Theorem 1 of Holland and Moulton, 2003) but it is either 1 or 2 in all the example data sets we looked at, which indicates that, as expected, there is a lot of agreement in the source trees. Comparing the two supertree methods {Sigma}Q50 and {Sigma}M50, we see that in general {Sigma}Q50 gave rise to fewer false positives (although they are rare for both methods) and more false negatives. As expected from simulation results (here and Huson et al., 2004), Z-closure networks can become very complex with data sets having many trees; the four largest data sets Carnivora, Felidae, Mustelidae, and Phocidae in Table 2 all have dimension 5 or higher. In contrast, {Sigma}Q20 is compatible (and so has dimension 1) for three of these four data sets, it is for the data sets with fewer trees that it mainly has dimension 2.

Case Study: Microgastrine Wasps
We tested Q-imputation on the data set in Banks and Whitfield (2006), which comprises seven genes and a total of 45 taxa. Each gene tree is missing no more than eight taxa. In particular, the genes were elongation factor 1-alpha (missing 3 taxa), LW rhodopsin (missing 2 taxa), wingless (missing 4 taxa), cytochrome oxidase I (missing 8 taxa), 28S rRNA (missing 5 taxa), 16S rRNA (missing 0 taxa), arginine kinase (missing 5 taxa). This data set presents a tough challenge to supernetwork methods because although the underlying evolutionary history is expected to be treelike (i.e., there is no hybridisation), the microgastrine wasps have undergone a rapid radiation (Banks and Whitfield, 2006). This means that the gene trees are likely to conflict with each other due to phylogenetic error and perhaps lineage sorting. We were particularly interested in the effect of reducing Q-imputation split systems by applying the threshold value for the complexity of the resulting split network, as compared with the effect of reducing Z-closure split systems by applying a maximum dimension filter.

We applied Q-imputation to create seven trees on all taxa (no ties were encountered), and then created a consensus network displaying any split that appeared in at least two of the seven output trees; i.e., we applied threshold value t = 20 (Fig. 3). All splits generated had the WIP, allowing us to calculate average relative weights. We also computed a Z-closure supernetwork using settings of 100 random orderings, average relative weights, and a maximum dimension filter of 2 (Fig. 4).


Figure 3
View larger version (142K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Figure 3 The Q-imputation supernetwork of seven gene trees for microgastrine wasps. Only splits that appear in two or more of the completed trees are displayed by the network. Edge weights are calculated using the average relative weights method of Huson et al. (2004)—see Methods for details. Taxa in capitals are part of the outgroup.

 


Figure 4
View larger version (126K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Figure 4 The Z-closure supernetwork of seven gene trees for microgastrine wasps. The network was created by SplitsTree4; settings used were 100 runs (fixed orderings), a maximum dimension filter of 2, and average relative edge weights. Taxa in capitals are part of the outgroup.

 
For both methods, filtering had a large impact on the output. By only displaying splits that appear in two or more trees from the Q-imputation trees, rather than displaying all splits, the number of splits was reduced from 227 to 79, resulting in a network of dimension 2. For Z-closure the output of 387 splits was reduced to a dimension 2 split system containing 126 splits.

In the Q-imputation network (Fig. 3), there are three areas of the graph where conflicting splits are represented, but overall it is quite treelike. In comparison, the maximum dimension filter has been rather less sucessful in reducing the complexity of the Z-closure supernetwork (Fig. 4), which is still quite complex. There are several places in the Q-imputation network (Fig. 3) where the graph is not well resolved. This is expected from the work of Banks and Whitfield (2006), where it was noted how difficult it is to resolve these relationships due to a rapid radiation of wasp genera. Also note that in the Q-imputation network, the outgroup taxa appear together separated from the rest of the taxa by a single edge, but in the Z-closure network, Pseudapanteles dignus is grouped with the outgroup taxa.


    Discussion
 Top
 Abstract
 Methods
 Results
 Discussion
 Appendix
 References
 
We have presented a quartet-based method called Q-imputation for constructing supertrees and supernetworks based on the application of consensus networks. We have compared its performance to the MRP supertree and the Z-closure supernetwork methods and also tested its performance on several real data sets.

As a supertree method, we found, in both simulations and examples, that Q-imputation was more conservative than MRP in that it tended to return fewer false-positive (unsupported) splits, but also fewer supported splits (more false negatives). As a supernetwork method, we found that Q-imputation tended to give rise to false positives but not false negatives, whereas Z-closure gave rise to false negatives but no false positives in the sense of Huson et al. (2004). Also, in simulations where there was an underlying species tree, we found that for Z-closure the number of Type II splits (splits that are not in the underlying species tree) increased with increasing number of gene trees, as opposed to the split system derived from applying a threshold to the trees completed by Q-imputation where the number of Type II splits had the desirable property of decreasing with increasing number of gene trees.

Note that in practice, the number of false negatives is not necessarily a useful measure. For the output to be visually palatable, one typically needs to restrict the number of splits that are being displayed. In fact one of the main strengths of Q-imputation over Z-closure is that it provides a natural means to filter out splits. Even so, developing methods for filtering split systems is an active research area (Huson et al., 2006), and so it is possible that the problem of increasing complexity of Z-closure networks with increasing numbers of gene trees will be resolved in the near future.

There are a number of ways that Q-imputation might be modified. For example, in the Q-imputation algorithm described in Methods, we could deal with ties in a different way; if we encounter two or more equally scoring placements for a taxon, delay adding that taxon and attempt to add the next taxon in the fixed ordering. We might also consider allowing quartets created by the inclusion of taxa to influence the score (in the current implementation only the original quartets contribute towards the placement scores). However, this would have the disadvantage of making the algorithm more order dependent, as the completion of one tree would influence the completion of other trees and would also raise the computation time. In addition, in case the Q-imputation algorithm terminates with no result in the case of insufficient information (due, e.g., to low overlap between the input trees), it might be of interest to investigate if insertion of taxa could still be supported by the hull of the relevant quartets in the other trees (i.e., quartets induced by the existing quartets) and could thus still be considered for insertion.

More generally, both Q-imputation and Z-closure theoretically require a high runtime. In Q-imputation, this is basically due to the cost of checking for matching quartets, whereas in Z-closure its due to the factorial number of fixed orders of splits. To counteract this problem in Q-imputation we considered an alternative scoring function to (1) based on splits rather than the quartets. However, we did not find this method satisfactory as it was strongly biased because unbalanced splits (i.e., ones having parts with a large difference in size) are compatible with more other splits than balanced splits. Another possibility that we have not explored yet is to consider a score function based on path-metrics in the input trees. A further possible weakness of both the Q-imputation and Z-closure methods is that although the edge weights can influence the final appearance of the supernetwork they have no influence on the splits generated before filtering is applied. It might be interesting to try and develop methods that made direct use of edge weight information.

In conclusion, we have presented a new method for computing supertrees and supernetworks. Although there are many available supertree methods (Bininda-Emonds, 2004c), until now, Z-closure was the only available supernetwork method. Our results indicate that Q-imputation and Z-closure have complementary strengths and weaknesses. We therefore expect that Q-imputation will provide a useful additional method for computing supernetworks.


    Appendix
 Top
 Abstract
 Methods
 Results
 Discussion
 Appendix
 References
 
Here we will prove the theorem presented in Methods:

Theorem Suppose that T1,..., Tg is a collection of phylogenetic trees that are all subtrees of some phylogenetic tree T. If the Q-imputation algorithm in Figure 1 runs to completion on this collection without encountering a tie, then its output T1*,..., Tg* satisfies T1* = ... = Tg* = T.

We begin by making a useful observation which is straightforward to establish.

Observation If T' is a subtree of a phylogenetic tree T, and x isin L(T) – L(T'), then there is a unique position p of T' so that T'(x,p) is a subtree of T.

Before proving the theorem, we prove a key lemma, for which we need to introduce some notation. Suppose T is a phylogenetic tree and x is in L(T). We denote by Tx the tree resulting by removing the pendant edge labeled by x from T and, if necessary, suppressing the resulting degree two node. Furthermore, we denote by P(T) the set of positions of T.

LemmaSuppose T1 and T2 are subtrees of a phylogenetic tree T with |L(T1) {cap} L(T2)| ≥ 4. Let x be in L(T1) {cap} L(T2), and let p*isin P(T2x) denote the unique position of T2x into which x can be inserted so that the resulting phylogenetic tree is T. Let pisin P(T2x)–{p*} and denote by T3 the phylogenetic tree obtained by inserting x into T2x at position p. Then


Formula

Proof: Note that it suffices to show that


Formula

So, suppose qisin Qx(T3){cap} Qx(T1). Then, because T1 is a subtree of T, we have qisin Qx(T). By assumption, the leaf labels of q are contained in L(T3) = L(T2). Combined with the fact that T2 is a subtree of T, this implies qisin Qx(T2). Consequently, qisin Qx(T2){cap} Qx(T1), as required.

Proof of the Theorem: Without loss of generality, it suffices to show that T1* is a subtree of T.

Let t = |L(T)–L(T1)| and suppose that x1,x2,...,xt are added to T1 by the Q-imputation algorithm, in the given order, to give the tree T1*. Suppose also that that the tree T1k obtained by inserting x1,x2,...,xk, 1 ≤ k < t, into T1 is a subtree of T.

Let Formula isin P(T1k) be the unique position of T' = T1k chosen by the algorithm to insert xk+1, that is, the position Formula for which S(xk+1,Formula ) > S(xk+1,p) holds for all pisin P(T')–{Formula }. We claim that Formula is the unique position p*isin P(T') for which T'(xk+1,p*) is a subtree of T (which must exist by the above observation). In particular, if T'(xk+1,Formula ) is a subtree of T then, by induction, T1* is a subtree of T, which completes the proof.

Suppose for contradiction that T'(xk+1,Formula ) is not a subtree of T. Because p*isin P(T')–{Formula } we have S(xk+1,p*) < S(xk+1,Formula ), and so by definition


Formula

Now for all 2 ≤ j ≤ t put T1 = Tj, T2 = T'(xk+1,p*), and T3 = T'(xk+1,Formula ). Then applying the above lemma to T1, T2,T3 implies


Formula

a contradiction.


    Acknowledgements
 
We thank Jim Whitfield for supplying the trees for the microgastrine wasp case study and interpretation of the results and James Cotton and Martyn Kennedy for supplying all other example data sets.


    References
 Top
 Abstract
 Methods
 Results
 Discussion
 Appendix
 References
 

    Bandelt H. J., Dress A. W. Split decomposition: A new and useful approach to phylogenetic analysis of distance data. Mol. Phyl. Evol. (1992) 1:242–252.[CrossRef][Medline]

    Banks J. C., Whitfield J. B. Dissecting the ancient rapid radiation of microgastrine wasp genera using additional nuclear genes. Mol. Phyl. Evol. (2006) 41:690–703.[CrossRef][Web of Science][Medline]

    Baum B. R. Combining trees as a way of combining data sets for phylogenetic inference, and the desirability of combining gene trees. Taxon. (1992) 41:3–10.[CrossRef][Web of Science]

    Bininda-Emonds O. R. P. Novel versus unsupported clades: Assessing the qualitative support for clades in MRP supertrees. Syst. Biol. (2003) 52:839–848.[Abstract/Free Full Text]

    Bininda-Emonds O. R. P. The evolution of supertrees. TREE (2004a) 19:315–322.[Medline]

    Bininda-Emonds O. R. P. Trees versus characters and the supertree/supermatrix "paradox." Syst. Biol. (2004b) 53:356–359.[CrossRef][Web of Science][Medline]

    Bininda-Emonds O. R. P., ed. Phylogenetic supertrees. Combining information to reveal the Tree of Life (2004c) Dordrecht, Boston, London: Kluwer Academic Publishers.

    Bininda-Emonds O. R. P., Bryant H. N. Properties of matrix representation with parsimony analysis. Syst. Biol. (1998) 47:497–508.[Web of Science][Medline]

    Bininda-Emonds O. R. P., Gittleman J. L., Purvis A. Building large trees by combining phylogenetic information: A complete phylogeny of the extant Carnivora (Mammalia). Biol. Rev. (1999) 74:143–175.

    Christiansen C., Mailund T., Pedersen C., Randers M. Computing the quartet distance between trees of arbitrary degree. In: WABI 2005 (2005) Berlin, Heidelberg, Budapest: Springer-Verlag. 77–88.

    Fitch W. M., Margoliash E. Construction of phylogenetic trees. Science (1967) 155:279–284.[Free Full Text]

    Gatesy J., Baker R. H., Hayashi C. Inconsistencies in arguments for the supertree approach: Supermatrices versus supertrees of Crocodylia. Syst. Biol. (2004) 53:342–355.[CrossRef][Web of Science][Medline]

    Gatesy J., Matthee C., Desalle R., Hayashi C. Resolution of a supertree/supermatrix paradox. Syst. Biol. (2002) 51:652–664.[Free Full Text]

    Holland B. R., Delsuc F., Moulton V. Visualizing conflicting evolutionary hypotheses in large collections of trees: Using consensus networks to study the origins of Placentals and Hexapods. Syst. Biol. (2005) 54:66–76.[Abstract/Free Full Text]

    Holland B. R., Jermiin L. S., Moulton V. Improved consensus network techniques for genome-scale phylogeny. Mol. Biol. Evol. (2006) 23:848–855.[Abstract/Free Full Text]

    Holland B., Moulton V. Consensus networks: A method for visualising incompatibilities in collections of trees. In: WABI 2003 (2003) Berlin, Heidelberg, Budapest: Springer-Verlag. 165–176.

    Huber K. T., Langton M., Penny D., Moulton V., Hendy M. Spectronet: A package for computing spectra and median networks. Appl. Bioinformatics (2002) 1:159–161.[Medline]

    Huber K. T., Moulton V. Phylogenetic networks. In: Mathematics of evolution and phylogeny—Gascuel O., ed. (2005) New York: Oxford University Press. 178–204.

    Huson D., Bryant D. Application of phylogenetic networks in evolutionary studies. Mol. Biol. Evol. (2006) 23:254–267.[Abstract/Free Full Text]

    Huson D., Dezulian T., Klopper T., Steel M. Phylogenetic super-networks from partial trees. IEEE/ACM Trans. Comput. Biol. Bioinformatics (2004) 1:151–158.[CrossRef]

    Huson D. H., Steel M., Whitfield J. Reducing distorsion in phylogenetic networks. 6th Workshop on Algorithms in Bioinformatics—WABI 2006: ZurichSwitzerland.

    Kennedy M., Page R. D. M. Seabird supertrees: Combining partial estimates of Procellariform phylogeny. Auk (2002) 199:88–108.

    Linder C. R., Rieseberg L. Reconstructing patterns of reticulate evolution in plants. Am. J. Bot. (2004) 91:1700–1708.[Abstract/Free Full Text]

    Little R. J. A., Rubin D. B. Statistical analysis with missing data (1987) New York: Wiley.

    Makarenkov V., Lapointe F. J. A weighted least-squares approach for inferring phylogenies from incomplete distance matrices. Bioinformatics (2004) 20:2113–2121.[Abstract/Free Full Text]

    Posada D., Crandall K. A. Intraspecific phylogenetics: Trees grafting into networks. Trends Ecol. Evol. (2001) 16:37–45.[CrossRef][Medline]

    Ragan M. A. Phylogenetic inference based on matrix representation of trees. Mol. Phyl. Evol. (1992) 1:53–58.[CrossRef][Medline]

    Steel M., Semple C. Phylogenetics (2003) New York: Oxford University Press.

    Swofford D. L. PAUP*. Phylogenetic analysis using parsimony (*and other methods), version 4 (1998) Sunderland, Massachusetts: Sinauer Associates.

    Swofford D. L., Olson G. J., Waddell P. J., Hillis D. M. Phylogenetic Inference. In: Molecular systematics—Hillis D. M., Moritz C., Mable B. K., eds. (1996) Sunderland, Massachusetts: Sinauer Associates. 407–514.

    Wilkinson M., Cotton J., Creevey C. J., Eulenstein O., Harris S. R., Lapointe F. J., McInerney J. O., Pisani D., Thorley J. The shape of supertrees to come: Input tree shape biases and some axiomatic properties of fourteen supertree methods. Syst. Biol. (2005a) 54:419–431.[Abstract/Free Full Text]

    Wilkinson M., Pisani D., Cotton J. A., Corfe I. Measuring support and finding unsupported relationships in supertrees. Syst. Biol. (2005b) 54:823–831.[Free Full Text]

    Yule G. U. A mathematical theory of evolution, based on the conclusions of Dr. J. C. Willis. Philos. Trans. R. Soc. Lond. B (1924) 213:21.


Add to CiteULike CiteULike   Add to Connotea Connotea   Add to Del.icio.us Del.icio.us    What's this?


This article has been cited by other articles:


Home page
Syst BiolHome page
J. B. Whitfield, S. A. Cameron, D. H. Huson, and M. A. Steel
Filtered Z-Closure Supernetworks for Extracting and Visualizing Recurrent Signal from Incongruent Gene Trees
Syst Biol, December 1, 2008; 57(6): 939 - 947.
[Full Text] [PDF]


This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (PDF) Freely available
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in ISI Web of Science
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrow Search for citing articles in:
ISI Web of Science (6)
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Holland, B.
Right arrow Articles by Moulton, V.
Right arrow Search for Related Content
PubMed
Right arrow Articles by Holland, B.
Right arrow Articles by Moulton, V.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?