Skip Navigation

Systematic Biology 2007 56(2):345-355; doi:10.1080/10635150701286549
This Article
Right arrow Extract Freely available
Right arrow FREE Full Text (PDF) Freely available
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in ISI Web of Science
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Gauthier, O.
Right arrow Articles by Lapointe, F.-J.
Right arrow Search for Related Content
PubMed
Right arrow Articles by Gauthier, O.
Right arrow Articles by Lapointe, F.-J.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?

© 2007 Society of Systematic Biologists

Seeing the Trees for the Network: Consensus, Information Content, and Superphylogenies

Edited by Allan Baker: Associate Editor

Olivier Gauthier and François-Joseph Lapointe

1 Département de sciences biologiques, Université de Montréal C.P. 6128, Succursale Centre-Ville, Montréal (Québec) H3C 3J7, Canada E-mail: olivier.gauthier{at}umontreal.ca francois-joseph.lapointe{at}umontreal.ca

Received May 9, 2006; Revised July 7, 2006; Accepted October 15, 2006 Phylogenetic inference often leads to solutions made up of multiple trees on a given set of leaves or taxa. These competing hypotheses might be the equally optimal trees obtained from the analysis of a single matrix using maximum parsimony (Hennig, 1979) or maximum likelihood (Felsenstein, 1981) or the set of most probable trees produced by Bayesian analysis (Rannala and Yang, 1996; Larget and Simon, 1999; Mau et al., 1999). They might also have been inferred independently from different data sets or even be the result of resampling methods such as the bootstrap or the jackknife (Felsenstein, 1985; Penny and Hendy, 1985, 1986). Consensus methods are commonly used to identify areas of conflict and agreement among such multiple trees; they can represent the relationships that are supported either unanimously or by a majority of trees while discarding other, less supported relationships (Bryant, 2003). Thus, a consensus method is usually defined as a function that takes as input a set, or profile, of trees on the same set of taxa and returns a single tree on the same set of taxa (Leclerc and Cucumel, 1987; Steel et al., 2000; Bryant, 2003). This function can take different forms, and many consensus methods have been proposed (see Swofford, 1991; Bryant, 2003, for reviews), amongst which the strict consensus (Sokal and Rohlf, 1981) and majority-rule consensus (Margush and McMorris, 1981) are perhaps the most widely used and understood. These various functions differ in two major respects: (1) the kind of information they preserve and (2) the way they deal with conflict among input trees. Indeed, the information contained in a tree can be considered in terms of nesting, three- or four-taxon statements, components, and branch lengths, while conflict can be left unresolved or dealt with using different criteria (Bryant, 2003). Notwithstanding these differences, most methods abide to the prevailing phylogenetic model: that of a tree embedding a hierarchy of descent. This model puts forward a fully resolved, or binary, tree as the ideal representation of evolution. In practice, consensus trees are seldom binary and they embed—i.e., are compatible with—multiple binary trees. Indeed, because it is often impossible to distinguish between hard and soft polytomies (Nelson and Platnick, 1980; Maddison, 1989) and because the latter can be resolved in a number of different ways, an unresolved tree can be refined by a set of binary trees (Rohlf, 1982; Mickevich and Platnick, 1989; Steel et al., 2000).

To circumvent this practical problem, a consensus method could be defined as a function that takes as input a profile of trees on the same set of taxa and returns one or multiple binary trees on the same set of taxa. Interestingly, on top of accounting for the fact that consensus trees are often unresolved, this definition allows for the formulation of consensus methods that produce multiple trees (Wilkinson, 1994; Lapointe and Cucumel, 2002) and networks (Bandelt, 1995; Holland and Moulton, 2003; Lapointe and Cucumel, 2003). This is desirable because incongruence among phylogenetic trees often leads to poorly resolved consensus solutions, and using multiple trees or networks can improve the representation of shared information among the fundamentals. Also, the treeness of phylogenies has been questioned and the so-called Tree of Life most probably ranges in complexity from that of a "simple" tree, to that of an entangled network, or even an inscrutable web (Doolitle, 1999; Daubin et al., 2002; Rivera and Lake, 2004).

The diversity of consensus methods already available (Bryant, 2003) makes it necessary to choose among them or among the solutions they produce. This decision can be based on axiomatic properties of the competing approaches. This is the position championed by Page (1992) for whom "choosing a consensus method requires at least two decisions: (1) what aspect of tree structure is considered important; and (2) what level of agreement between trees is to be reflected in the consensus tree." From this standpoint, it can be argued that methods that are Pareto and co-Pareto are preferable over those that are not. Indeed, a method that is Pareto will produce solutions that include statements made by all input trees. Likewise, a method that is co-Pareto will display relationships that are present in at least one input tree. Both these properties are desirable when a consensus is perceived as a means to summarize the statements made by the input trees. However, the choice of a consensus solution could also be based on the amount of phylogenetic information it conveys. How many statements does it make about the taxa under consideration? Are these statements unambiguous? How many taxa are found in the consensus? How many statements that could be made about the taxa are disallowed by the consensus? How many binary trees, out of all the possible trees with the same number of leaves as the consensus, are compatible with the consensus? All these questions reflect what biologists consider relevant in a phylogeny: grouping. They have to be answered in order to measure the phylogenetic information content of a consensus and, for this to be practical, an objective measure of information such as the cladistic information content (CIC) is needed (Thorley et al., 1998; Thorley, 2000). Evidently, the CIC of solutions produced by a given consensus method depends both on the input trees and the procedure itself, and ultimately we might seek a balance between these two criteria.

In this paper, we discuss different ways to produce consensus solutions containing more phylogenetic information in the form of consensus networks (CN). We introduce a new method for constructing CN complementing those of Bandelt (1995), Holland and Moulton (2003), and Lapointe and Cucumel (2003) and present a general framework to obtain a CN from any input profile of weighted or unweighted trees. Furthermore, the CIC, a measure of phylogenetic information content originally devised for strict consensus trees (Thorley et al., 1998), is extended here to measure the information content of consensus networks. Finally, this measure and some properties of CN are illustrated with an application to the phylogeny of mammalian orders.


    Computing a Consensus Network
 Top
 Computing a Consensus Network
 The Information Content of...
 Application
 Discussion
 Acknowledgment
 References
 
In the spirit of the many flavors of matrix representation (MR) in the consensus setting, a general approach for constructing a CN of a set of k trees defined on the same set of leaves S involves two steps (see Lapointe et al., 2003; Wilkinson et al., 2005): (1) obtain an MR from the input profile and (2) compute a network from this MR. Both steps can be achieved in a number of ways (e.g., Holland and Moulton, 2003; Lapointe et al., 2003; Holland et al., 2004, 2005). The approach presented here can be used with rooted and unrooted, as well as weighted and unweighted trees. While a number of other metrics could be used, at least two different nx n distance matrices d can represent a tree T defined on the set S = {1,...,n} (Fig. 1). Discarding branch lengths and thus focusing only on topological information, the branch distance between two taxa is obtained by counting the number of branches separating them (Zaretskii, 1965). Accounting for branch lengths, the path-length distance between two leaves is the sum of the weight of the branches between them (Buneman, 1971). Computing branch- or path-length distance matrices for every tree, the input profile of trees is recoded in a series of matrix representations with distances (MRD) that must be combined in a single matrix prior to network reconstruction.


Figure 1
View larger version (82K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Figure 1 A matrix representation (MR) of a tree can take the form of a binary character matrix (left) or a distance matrix (right). Full lines indicate correspondence between trees and MR as well as a combination of MR from different trees in a unique MR. Dashed lines indicate equivalence between two MR. Each split corresponds to a binary character for which the taxa on either side of the split are coded 1 and 0, respectively. Informative characters (partitioning the taxa in two groups of two) are numbered 1 to 3, whereas uninformative characters (partitioning the taxa in a group of three and a singleton) are only shown for the MR of single trees. Binary MR for two (or more) trees are obtained by combining the character matrices including the uninformative characters. Distance MR are obtained by summing the lengths of branches between pairs of taxa (here all equal 1); average distance MR for two (or more) trees contain the average path length distances between pairs of taxa.

 
Alternatively, the k input trees could be coded as binary characters in a pseudo-character matrix as for matrix representation with parsimony (MRP; Baum, 1992; Ragan, 1992; Purvis, 1995). This is the approach proposed by Holland and Moulton (2003) for the construction of CN. Because we are dealing with unrooted trees, we focus on bipartitions, or splits, of S rather than components. A split defines two non-empty subsets S' and S'' of S such that S' {cup} S'' = S and S' {cap} S'' = Formula . A tree is composed of a set of compatible splits and each split in the input trees gives rises to a binary character for which taxa in S' and S" are assigned the values 0 and 1, respectively.

We propose to use the average path-length distance matrix in which the distance between leaves A and B is the arithmetic mean of the distance between A and B computed over the k matrix representations as the MR of the input profile of trees. It should be noted that the median or another parameter could be used as well (Lapointe and Cucumel, 1997). Also, computing a Hamming distance (or single character difference) on the MR of splits and dividing by k would yield the mean branch distance matrix (Lapointe et al., 2003). At this point it should be clear that not only a collection of trees but also a collection of networks, or simply splits, whether they are compatible or not, could be combined in this way.

The second step towards the construction of a CN, inferring the network from the MR, necessitates the choice of an adequate method. Recent years have seen the appearance of numerous techniques and software to infer phylogenetic networks and detect reticulate taxa (e.g., Fitch, 1997; Lapointe, 2000; Strimmer and Moulton, 2000; Xu, 2000; Posada and Crandall, 2001; Strimmer et al., 2001; Gusfield et al., 2003; Gusfield and Bansal, 2005; Huson and Kloepper, 2005; Huson et al., 2005;Nakhleh et al., 2004). The motivation for these developments was either to graphically illustrate conflicting signals in a given data set (e.g., Bandelt and Dress, 1992); to elucidate reticulate relationships between taxa brought about by recombination, lateral gene transfer, or hybridization events (e.g., Xu, 2000; Legendre and Makarenkov, 2002; Gusfield et al., 2003; Huson and Kloepper, 2005); or to produce a single graphical representation of the set of maximum parsimony trees directly from the data set (Bandelt et al., 1995; Fitch, 1997).

We propose to use the split decomposition method of Bandelt and Dress (1992) on the average MRD of the input trees (but see Holland and Moulton, 2003, for a similar application using bipartitions and median networks). Split decomposition is a distance-based method that was developed to identify and represent conflicting signals in any phylogenetic data set (Bandelt and Dress, 1992). It considers in turn all the quartets of taxa in the data set with regards to the four-point condition and the three distinguishable unrooted trees involving four objects (AB| CD, AC| BD, and AD| BC). From these three, it then rejects the most violating split and, unless the data are perfectly tree-like, keeps the remaining two that are termed weakly compatible and can be represented by a planar circular split network but not by a tree (Bandelt and Dress, 1992). Differential support, in the form of an isolation index, is attributed to both bipartitions (see also Winkworth et al., 2005). The full representation on all quartets of the set of weakly and fully (i.e., tree-like) compatible splits is a usually planar, or mildly nonplanar, circular split network (also called a splits graph; Dress and Huson, 2004). In other words, the split decomposition graph will only display splits for which the induced four-taxon statements are never the least supported of the three possible four-taxon statements on those taxa (for example, the split AB| CDE is displayed only if AB| CD, AB| CE, and AB| DE are not rejected). When given a set of fully compatible splits the split decomposition network is a tree. Computing a split decomposition network on the MRD of an input profile of trees gives a split decomposition CN (or more generally a consensus network). Interestingly, a split decomposition CN of any two input trees will always display both these input trees (Bandelt and Dress, 1992). However, the split decomposition CN of three or more trees does not always satisfy this property because it then presents an approximation of the splits in the input trees (Bryant et al., 2003).

Consider the three unrooted phylogenies for four taxa depicted in Figure 2a. The strict consensus of this profile contains all and only those bipartitions that are present in all trees and is thus completely unresolved (Fig. 2b). The majority-rule consensus would also be a star tree and the same goes for the strict or majority-rule consensus of any two of these trees. On the other hand, a split decomposition CN of any pair of these trees combines both weakly compatible splits in a single representation (Fig. 2c). Thus, in a split decomposition CN, it is possible to retain two weakly compatible splits rather than throwing away valuable information. For example, considering the two topmost trees in Figure 2a, both splits AB| CD and AC| BD are supported by the data; split AD| BC is not, however. The split decomposition CN embeds these weakly compatible signals (Fig. 2c), whereas a strict consensus concludes to an absence of shared information (Fig. 2b). An unconstrained median CN of the three trees in Figure 1a will always display all three splits at once (Fig. 2d). Just like the star tree (Fig. 2b), it is compatible with all of them. In the special case where all branch lengths are equal, or when using branch distance, the split decomposition CN of the MRD of the three trees in Figure 2a is also a star tree (Wilkinson et al., 2003). Interestingly, with the median CN of Holland and Moulton (2003), it is possible to set a support threshold so that splits that do not appear in at least a specified number of trees are not displayed in the consensus solution: a more restrictive threshold will reduce the dimensionality of the solution and potentially increase its information content.


Figure 2
View larger version (25K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Figure 2 (a) The three possible unrooted trees with four taxa; (b) the strict and majority rule consensus tree of any combination of the trees in (a), it is also the split decomposition CN of the three trees in (a) when all branch lengths are equal; (c) the split decomposition CN of any pair of the trees in (a); and (d) the unconstrained median CN of all three trees in (a). The solutions in (b) and (d) are both compatible with all the trees in (a) and have a CICrel = 0.00.

 

    The Information Content of Split Networks
 Top
 Computing a Consensus Network
 The Information Content of...
 Application
 Discussion
 Acknowledgment
 References
 
Recent papers focusing on the use of networks in phylogenetics have pointed out that network methods were desirable because they often convey more phylogenetic information than trees (Bryant et al., 2003; Wilkinson et al., 2003; Holland et al., 2005; Winkworth et al., 2005; Huson and Bryant, 2006). They all fail, however, to provide an objective and efficient way to quantify the information content of networks and to compare it to that of trees. Although the notion of how much information a tree contains can seem rather intuitive, formal definitions are still needed; numerous information indices have been proposed in the past, but most are flawed (Page, 1992; Thorley, 2000). Although the information contained in branch lengths is vital to phylogenetic reconstruction, the amount of information conveyed by a phylogeny is usually understood in terms of grouping because the phyletic relationships among species are defined only by cladogenesis. An appropriate information measure should tell us something about the uncertainty placed on the existence and composition of such groups. Namely, a star tree is compatible with all possible groupings of the leaves, and thus maximally uncertain; it contains no cladistic information. A fully resolved tree, on the other hand, leaves no uncertainty as to the way the taxa are grouped; it is thus maximally informative. Similarly, an unresolved tree, or a network, will exhibit varying degrees of uncertainty between these two extremes, and should be attributed intermediate information values. Moreover, although parameters such as tree shape and labeling should not influence a measure of the information content of phylogenies, tree size should be taken into account. Thus, a fully resolved phylogeny on 50 taxa should come out as more informative than a fully resolved tree defined on 20 taxa. Thorley et al. (1998) have proposed a measure of the information content of phylogenetic trees that fits these requirements and takes its roots in information theory, in the form of their cladistic information content (CIC).

The information conveyed by an observation (I) depends both on the number of equally probable possibilities before the observation (P0) and the number of equally probable possibilities after the observation (P1; Brillouin, 1962):


Formula 1

(1)

The CIC of a tree T on a set of leaves S = {1,2,...,n} is thus defined as being inversely proportional to the ratio of the number of binary trees permitted by T (nR) to the number of possible binary trees for n taxa (nT; Thorley et al., 1998):


Formula 2

(2)
where, in the case of unrooted trees (Edwards and Cavalli-Sforza, 1964):


Formula 3

(3)
where di is the degree of node i in the set of nodes V(T) from tree T and:


Formula 4

(4)

In order to compute the CIC of a network, one has to enumerate the multiple trees embedded in the network. Just like each of the networks in Figure 2c displays two of the three trees in Figure 2a, a network can display two or more trees and can be considered as a collection of trees (Huson and Bryant, 2006). Information is said to be redundant when more bits are used to transmit a message than the number of actual bits contained in the message itself (Brillouin, 1962). In the present case, redundancy would follow from counting the same tree more than once in a given network; for example, by counting an unresolved tree and one or more trees that refine it. In order to eliminate any redundancy, the trees contained in the network are obtained by extracting all the largest sets of compatible splits from the set of weakly compatible, or incompatible, splits that make up the network with a brute force approach; this yields only the maximally resolved (although not necessarily binary) trees embedded in the network. For example, the network with splits AB|CDE, AE|BCD, and BC|ADE (Fig. 3a) contains two maximally resolved trees. One of these (Fig. 3b) has three possible resolutions (Fig. 3d); the other is binary (Fig. 3c). Median networks and split decomposition networks belong to the family of "split networks," which are "combinatorial generalization of phylogenetic trees" and can be fully represented by a split table, a table listing the splits in the splits network and their respective weight (Huson and Bryant, 2006). From the split table, build a compatibility matrix among splits while ignoring trivial splits (Jakobsen and Easteal, 1996). Pick a split, add to it a compatible split, and add to these a split that is compatible with both splits. Continue adding splits that are compatible to all splits in the set until no more can be found. Sequentially assemble every set of fully compatible splits by applying this procedure to the entire compatibility matrix. Our naïve implementation of this procedure (available upon request) reads a split table and outputs sets of binary characters corresponding to the maximally resolved trees extracted from the network. These can be used to recover the network or the individual trees. At the moment, this approach is computationally intensive and it might limit the applicability of the CIC to smaller data sets or less complex networks. Alternatively, compatibility matrices can be constructed from binary characters in Spectronet (Huber et al., 2002), and split networks can be graphically manipulated in both Spectronet and SplitsTree4 (Huson and Bryant, 2006). The CIC of the network is then computed as that of a tree, by summing the nR over all maximally resolved trees contained in the network. This is equivalent to calculating the combined CIC of multiple trees that do not contain redundant information (Thorley, 2000). The CIC of the network (N) is then:


Formula 5

(5)


Figure 3
View larger version (21K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Figure 3 (a) The network with splits AB|CDE, AE|BCD, and BC|ADE contains two maximally resolved trees (b and c). Tree (b) has three resolutions, whereas tree (c) is binary. The network thus allows four distinct fully resolved trees and as a CICrel = 0.49.

 
Thus, the CIC of a split decomposition CN, or more generally a split network, is the sum of the CIC of all the maximally resolved trees embedded in this graph. Note that if N is a tree, then Equation 4 is equivalent to Equation 2 and we get back to the original CIC (Thorley et al., 1998). The maximum CIC value, CICmax = –log 1/nT, depends only on n. To render the comparison of different CIC values easier, the relative CIC is computed as:


Formula 6

(6)

For example, the network in Figure 3a allows four fully resolved trees out of the 25 possible unrooted binary trees for five taxa and has a CICrel = 0.49. Although the relative measure has no units, information is usually expressed either in bits or in nats, depending on the base of the logarithm, 2 or e, respectively. If T is a binary tree, then nR equals one; otherwise, nR equals the number of binary trees that are compatible with T; i.e., the number of binary trees that refine T. The CIC and the resolution of individual trees can be computed within RadCon (Thorley and Page, 2000).


    Application
 Top
 Computing a Consensus Network
 The Information Content of...
 Application
 Discussion
 Acknowledgment
 References
 
In order to illustrate the use of CN and CIC, we constructed different consensus of the eight mammalian gene trees presented in Figure 4 (data from Springer et al., 1999; available at http://hydrodictyon.eeb.uconn.edu/systbiol/issues/48_1/Springer). The strict consensus (not shown) is completely unresolved and has a corresponding CIC of 0. The majority-rule consensus offers some resolution (Fig. 5a); it can be refined by 945 different unrooted binary trees and has a CICrel of 0.61. The CIC of the consensus is improved by using a CN. The median CN containing the splits included in at least two trees (Fig. 5b) has a CICrel of 0.66 (396 trees), whereas the split decomposition CN (Fig. 5c) has a CICrel of 0.74 (90 trees). If only the splits included in at least three trees are accounted for, the CICrel of the median CN is 0.80 (30 trees; Fig. 5d). The unrestricted median CN (including all the splits in the input trees) contained little more information than a star tree and exhibited the highly incompatible signals in the input trees.


Figure 4
View larger version (103K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Figure 4 Phylogenetic trees for 11 mammalian species obtained from eight mitochondrial and nuclear genes (data from Springer et al., 1999).

 


Figure 5
View larger version (69K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Figure 5 Different consensus representations of the profile of trees in Figure 4. (a) The majority rule consensus (CICrel = 0.61); (b) the median CN restricted to splits contained in at least two trees (CICrel = 0.66); (c) the split decomposition CN (CICrel = 0.74); (d) the median CN restricted to splits contained in at least three trees (CICrel = 0.80).

 

    Discussion
 Top
 Computing a Consensus Network
 The Information Content of...
 Application
 Discussion
 Acknowledgment
 References
 
Many information measures have been proposed for consensus trees (e.g., Mickevich and Farris, 1981; Mickevich and Platnick, 1989) but, as reviewed by Thorley (2000), they have many shortcomings. More importantly, most are affected by tree shape, such that pectinate trees are considered more informative than balanced trees. The CIC, on the other hand, is shapeless and takes into account the relative position and size of the components (Thorley, 2000). It is an appropriate measure of the information content of phylogenies. This measure can thus help researchers choose among competing consensus methods and solutions: the most informative ones should be preferred as long as the information they convey is relevant. Relevancy is dependent on the axiomatic properties of the methods, such that quantity of information should not be equated with its quality. Indeed, a consensus method that returns a random binary tree would score maximally with a measure such as the CIC: it apparently provides a very informative solution. Nonetheless, it contains no relevant phylogenetic information because it preserves only the names of the taxa in the input trees. The information contained in a split decomposition CN is relevant because it consists only of the primary and secondary signals in the input profile of trees.

When T is a consensus tree, Thorley et al. (1998) state that the CIC can only be used if T is strict (sensuWilkinson, 1994); i.e., if T contains only relationships that are present in all the input trees. For example, the strict consensus tree is strict for components because it contains only components present in all input trees, whereas the Adams consensus is strict on nesting because it contains only nestings present in all input trees (Adams, 1972). Hence, they consider a consensus as being only a summary or representation of the source trees and equate the trees counted by nR not with the binary trees that are compatible with T but with the trees that could have been represented among the input trees. Following this definition, majority-rule consensus trees, all MR consensus trees, as well as the consensus networks presented here, contain no phylogenetic information because it is impossible "to deduce from the consensus tree" or network "alone which of the possible trees could not have been represented among the fundamentals" (Thorley et al., 1998). The problem we have with this assertion is that it concludes that most of the phylogenetic supertrees published so far do not contain any phylogenetic information. Indeed, supertree methods are generalizations of consensus methods for input trees defined on partially overlapping sets of taxa; the vast majority of supertree methods are not strict (but see the strict component supertree of Bryant, 2002).

We see no reason not to consider a consensus method as a meta-analytical tool. Consequently, it is not required to be strict in order for the above measure of phylogenetic information to be applicable. This allows considering as phylogenetically informative the solutions of consensus methods that preserve only the most supported relationships (e.g., majority-rule consensus), strip the least supported ones (e.g., split decomposition CN or constrained median CN), or optimize a given criterion (e.g., average consensus of Lapointe and Cucumel, 1997). Moreover, it allows for the measure of the information content of the numerous supertrees published so far.

Although Thorley et al. (1998) did not consider branch lengths in their CIC, this important element could be taken into account in a weighted version of the information measure of CN. This might prove difficult, because we must not only deal with the relative order of internal nodes as in the dendritic information content (Thorley, 2000) but also with the relative support of competing hypotheses. Although such developments are beyond the scope of this paper, it should be pointed out that, in this context, a multidimensional CN (using the method of Holland and Moulton, 2003, for example) might be more informative than a planar CN. However, a high dimensionality of the solution will still leave it overwhelmingly difficult to interpret and most probably contain much noise.

However, when considering only the topology of the consensus solution to calculate its CIC, a CN using split decomposition will usually contain more information than one using median networks. Indeed, as Cassens et al. (2005) rightfully noted, "a graph with maximum compatibility (i.e., [...] a complete graph [...] where each node is connected to all others) has a 100% compatibility [...] yet it is of little value since it conveys no genealogical information." The CN based on split decomposition will return a star tree only when the three topologies are equally supported (Wilkinson et al., 2003). When the trees are differentially supported, the least supported topology is eliminated, thus increasing the phylogenetic information content of the CN. Hence, unless it is constrained to low dimensionality, a median CN can carry little information while being hard to read. The support threshold introduced by Holland and Moulton (2003) allows for the median CN to range in complexity from an unconstrained high dimensional network displaying all the splits in the input trees to a solution displaying only the splits present in all the inputs trees, effectively a strict consensus. The CIC constitutes an objective measure for setting this threshold value, as the preferred solution could be the one that maximizes the CIC.

One way to further maximize the information content of consensus solutions without any more relaxation of the treeness criterion would be to use reduced CN profiles. Reduced consensus methods are more sensible than standard consensus to information shared among competing trees (Wilkinson, 1994, 1995, 1996). Although the formal definition of a reduce CN is beyond the scope of this paper, a surrogate to reduce CN would be to sequentially remove problematic taxa from the matrix representation before computing the solution. To convey more information than a conventional consensus tree, a CN should not contain complex multidimensional sets of incompatible splits because these cannot be more informative than polytomies. CN can thus be used to increase the CIC of consensus solutions, not with arbitrary conflict resolution or by displaying all the input trees, but by weeding out less supported relationships.

Other potential applications of the CIC of networks include the construction of split networks directly from the data as well as the statistical description of networks constructed from confidence sets, or credibility sets, of trees. Indeed, networks can become extremely complex when the data contains many incompatible signals. The CIC could be used as a way to control the complexity of these networks by allowing only those splits for which the improvement in fit does not lead to an unjustified loss of CIC. Moreover, in the framework we have put forth, a split network is considered as a collections of trees—i.e., those trees that are compatible with the network; this could have a concrete statistical meaning when applied to networks constructed from confidence sets, or credibility sets, of trees, as it could allow for the comparison of different such sets.

A current concern in phylogenetics is our ability to construct large phylogenies using supermatrix or supertree approaches (Sanderson et al., 1998; Bininda-Emonds et al., 2002; Wilkinson et al., 2005). Although we are not concerned here with supermatrices or the debate on whether we should combine data or trees, we would like to point out that the framework we have outlined here for inferring CN is easily extendable to the construction of super-networks, at least in theory. This would constitute an alternative to the Z-closure algorithm (Huson et al., 2005) implemented in SplitsTree4 (Huson and Bryant, 2006). However, combining the MR of trees (or networks) defined on partially overlapping sets of taxa leads to a number of new problems. In the first place, it generates missing data in the MR of the input trees; pseudo-characters are coded as missing for all the taxa not present in a given tree, whereas no distance exists for pairs of taxa that are never present in the same tree. Missing distances can be estimated using tested and accurate methods (Levasseur et al., 2003; Makarenkov and Lapointe, 2004), and most network construction algorithms can accommodate missing values in a character matrix; for example, the distances required for split decomposition can be computed on a data matrix with missing cells. However, the impact of these new elements, missing data and the recourse to estimated distances, has not been assessed in the context of phylogenetic network inference and should be addressed before we can confidently recommend the construction of such supernetworks and measure their information content.


    Acknowledgment
 Top
 Computing a Consensus Network
 The Information Content of...
 Application
 Discussion
 Acknowledgment
 References
 
The authors are grateful to Allan Baker, Denis Barabé, Luc Brouillet, David Bryant, Barbara Holland, Roderic Page, and all members of the LEMEE for constructive comments on an earlier version of this manuscript. This work was supported by NSERC and FQRNT scholarships to OG, NSERC grant no. OGP0155251 and FQRNT grant no. PR88559 to FJL. Thanks are extended to Mark Springer for making available the data used in the application.


    References
 Top
 Computing a Consensus Network
 The Information Content of...
 Application
 Discussion
 Acknowledgment
 References
 

    Adams E. N. III. Consensus techniques and the comparison of taxonomic trees. Syst. Zool. (1972) 21:390–397.[Abstract/Free Full Text]

    Bandelt H.-J. Combination of data in phylogenetic analysis. Plant Syst. Evol. Suppl. (1995) 9:355–361.

    Bandelt H.-J., Dress A. W. M. Split decomposition: A new and useful approach to phylognetic analysis of distance data. Mol. Phylogenet. Evol. (1992) 1:242–252.[CrossRef][Medline]

    Bandelt H.-J., Forster P., Sykes B. C., Richards M. B. Mitochondrial portraits of human populations using median networks. Genetics (1995) 141:743–753.[Abstract]

    Baum B. R. Combining trees as a way of combining data sets for phylogenetic inference, and the desirability of combining gene trees. Taxon (1992) 41:3–10.[CrossRef][Web of Science]

    Bininda-Emonds O. R. P., Gittleman J. L., Steel M. A. The (super)tree of life: Procedures, problems, and prospects. Annu. Rev. Ecol. Syst. (2002) 33:265–289.[CrossRef][Web of Science]

    Brillouin L. Science and information theory (1962) 2nd English edition. New York: Academic Press.

    Bryant D. Strict consensus supertrees (2002) Montreal, Canada: School of Computer Science, McGill University. Technical report.

    Bryant D. A classification of consensus methods for phylogenetics. In: Bioconsensus—Janowitz M., Lapointe F.-J., McMorris F., Mirkin B., Roberts F. B., eds. (2003) Providence, Rhode Island: American Mathematical Society. 163–184.

    Bryant D., Huson D. H., Kloepper T., Nieselt-Struwe K. Distance corrections on recombinant sequences. In: Algorithms in bioinformatics: Third International Workshop, WABI 2003—Benson G., Page R. D. M., eds. (2003) Berlin/Heidelberg, Germany: Springer. 271–286. Lecture Notes in Computer Science 2812.

    Buneman P. The recovery of trees from measures of dissimilarity. In: Mathematics in the archaeological and historical sciences—Hodson F. R., Kendall D. G., Tautu P., eds. (1971) Edinburgh, England: Edinburgh University Press. 387–395.

    Cassens I., Mardulyn P., Milinkovitch M. C. Evaluating intraspecific "network" construction methods using simulated sequence data: Do existing algorithms outperform the global maximum parsimony approach? Syst. Biol. (2005) 54:363–372.

    Daubin V., Gouy M., Perriere G. A phylogenomic approach to bacterial phylogeny: Evidence of a core of genes sharing a common history. Genome Res. (2002) 12:1080–1090.[Abstract/Free Full Text]

    Doolittle W. F. Phylogenetic classification and the universal tree. Science (1999) 284:2124–2128.[Abstract/Free Full Text]

    Dress A. W. M., Huson D. H. Constructing splits graphs. IEEE/ACM Trans. Comput. Biol. Bioinform. (2004) 1(3):109–115.[CrossRef]

    Edwards A. W. F., Cavalli-Sforza L. L. Reconstruction of evolutionary trees. In: Phenetic and phylogenetic classification—Heywood W. H., McNeill J., eds. (1964) London, England. 67–76. Systematics Association Publication No. 6.

    Felsenstein J. Evolutionary trees from DNA sequences: A maximum likelihood approach. J. Mol. Evol. (1981) 16:368–376.

    Felsenstein J. Confidence limits on phylogenies: An approach using the bootstrap. Evolution (1985) 39:783–791.[CrossRef][Web of Science]

    Fitch W. M. Networks and viral evolution. J. Mol. Evol. (1997) 44:S65–S75.[CrossRef][Web of Science][Medline]

    Gusfield D., Bansal V. A fundamental decomposition theory for phylogenetic networks and incompatible characters. Miyano S., Mesirov J. P., Kasif S., Istrail S., Pevzner P. A., Waterman M. S., eds. (2005) Proceedings of the Ninth Annual International Conference on Research in Computational Molecular Biology, RECOMB 2005. Berlin/Heidelberg, Germany: Springer. 217–232.

    Gusfield D., Eddhu S., Langley C. Optimal, efficient reconstruction of phylogenetic networks with constrained recombination. J. Bioinform. Comput. Biol. (2003) 2:173–213.[CrossRef]

    Hennig W. Phylogenetic systematics (1979) Urbana, Illinois: University of Illinois Press.

    Holland B. R., Delsuc F., Moulton V. Visualizing conflicting evolutionary hypotheses in large collections of trees: Using consensus networks to study the origins of placentals and hexapods. Syst. Biol. (2005) 54:66–76.[Abstract/Free Full Text]

    Holland B. R., Huber K. T., Moulton V., Lockhart P. J. Using consensus networks to visualize contradictory evidence for species phylogeny. Mol. Biol. Evol. (2004) 21:1459–1461.[Abstract/Free Full Text]

    Holland B. R., Moulton V. Consensus networks: a method for visualising incompatibilities in a collection of trees. In: Algorithms in bioinformatics: Third International Workshop, WABI 2003—Benson G., Page R. D.M, eds. (2003) Berlin/Heidelberg, Germany: Springer. 165–176. Lecture Notes in Computer Science 2812.

    Huber K. T., Langton M., Penny D., Moulton V., Hendy M. Spectronet: A package for computing spectra and median networks. Appl. Bioinformat. (2002) 1:159–161.

    Huson D. H., Bryant D. Application of phylogenetic networks in evolutionary studies. Mol. Biol. Evol. (2006) 23:254–267.[Abstract/Free Full Text]

    Huson D. H., Kloepper T. H. Computing recombination networks from binary sequences. Bioinformatics (2005) 21(Suppl. 2):159–165.[CrossRef][Web of Science]

    Huson D. H., Kloepper T., Lockhart P. J., Steel M. A. Reconstruction of reticulate networks from gene trees. Miyano S., Mesirov J. P., Kasif S., Istrail S., Pevzner P. A., Waterman M. S., eds. (2005) Proceedings of the Ninth Annual International Conference on Research in Computational Molecular Biology, RECOMB 2005. Berlin/Heidelberg, Germany: Springer. 233–249.

    Jakobsen I. B., Easteal S. A program for calculating and displaying compatibility matrices as an aid in determining reticulate evolution in molecular sequences. CABIOS (1996) 12:291–295.[Medline]

    Lapointe F.-J. How to account for reticulation events in phylogenetic analysis: A comparison of distance based methods. J. Classif. (2000) 17:175–184.[CrossRef]

    Lapointe F.-J., Cucumel G. The average consensus procedure: Combination of weighted trees containing identical or overlapping sets of objects. Syst. Biol. (1997) 46:306–312.[Abstract/Free Full Text]

    Lapointe F.-J., Cucumel G. Multiple Consensus Trees. In: Classification, clustering, and data analysis—Jajuga K., Sokolowski A., Bock H.-H., eds. (2002) Berlin, Germany: Springer. 359–364.

    Lapointe F.-J., Cucumel G. How good can a consensus get? Assessing the reliability of consensus trees in phylogenetic studies. In: Bioconsensus—Janowitz M., Lapointe F.-J., McMorris F., Mirkin B., Roberts F. B., eds. (2003) Providence, Rhode Island: American Mathematical Society. 205–219.

    Lapointe F.-J., Wilkinson M., Bryant D. Matrix representations with parsimony or with distances: Two sides of the same coin? Syst. Biol. (2003) 52:865–868.

    Larget B., Simon D. L. Markov chain Monte Carlo algorithms for the Bayesian analysis of phylogenetic trees. Mol. Biol. Evol. (1999) 16:750–759.[Web of Science]

    Leclerc B., Cucumel G. Consensus en classification: Une revue bibliographique. Mathématique en Sciences Humaines (1987) 100:109–128.

    Legendre P., Makarenkov V. Reconstruction of biogeographic and evolutionary networks using reticulograms. Syst. Biol. (2002) 51:199–216.[Abstract/Free Full Text]

    Levasseur C., Landry P. A., Makarenkov V., Kirsch J. A. W., Lapointe F.-J. Incomplete distance matrices, supertrees and bat phylogeny. Mol. Phylogenet. Evol. (2003) 27:239–246.[CrossRef][Web of Science][Medline]

    Maddison W. Reconstructing character evolution on polytomous cladograms. Cladistics (1989) 5:365–377.[CrossRef][Web of Science]

    Makarenkov V., Lapointe F.-J. A weighted least-squares approach for inferring phylogenies from incomplete distance matrices. Bioinformatics (2004) 20:2113–2121.[Abstract/Free Full Text]

    Margush T., McMorris F. R. Consensus n-trees. Bull. Math. Biol. (1981) 43:239–244.[Web of Science]

    Mau B., Newton M. A., Larget B. Bayesian phylogenetic inference via Markov chain Monte Carlo methods. Biometrics (1999) 55:1–1.[CrossRef][Web of Science][Medline]

    Mickevich M. F., Farris J. S. The implications of congruence in Menidia. Syst. Zool. (1981) 30:351–370.[Free Full Text]

    Mickevich M. F., Platnick N. I. On the information content of classifications. Cladistics (1989) 5:33–47.[CrossRef][Web of Science]

    Nakhleh L., Warnow T., Linder C. R. Reconstructing reticulate evolution in species—Theory and practice. Bourne P. E., Gusfield D., eds. (2004) Proceedings of the Eighth Annual International Conference on Research in Computational Molecular Biology, RECOMB 2004. New York: ACM Press. 337–346.

    Nelson G. J., Platnick N. I. Multiple branching in cladograms: Two interpretations. Syst. Zool. (1980) 29:86–91.[Free Full Text]

    Page R. D. M. Comments on the information content of classifications. Cladistics (1992) 8:87–95.[CrossRef][Web of Science]

    Penny D., Hendy M. D. Testing methods of evolutionary tree construction. Cladistics (1985) 1:266–278.[CrossRef]

    Penny D., Hendy M. D. Estimating the reliability of evolutionary trees. Mol. Biol. Evol. (1986) 3:403–417.[Abstract]

    Posada D., Crandall K. A. Intraspecific gene genealogies: Trees grafting into networks. Trends Ecol. Evol. (2001) 16:37–45.[CrossRef][Medline]

    Purvis A. A modification to Baum and Ragan's method for combining phylogenetic trees. Syst. Biol. (1995) 44:251–255.[Free Full Text]

    Ragan M. A. Phylogenetic inference based on matrix representation of trees. Mol. Phylogenet. Evol. (1992) 1:53–58.[CrossRef][Medline]

    Rannala B., Yang Z. Probability distribution of molecular evolutionary trees: A new method of phylogenetic inference. J. Mol. Evol. (1996) 43:304–311.[Web of Science][Medline]

    Rivera M. C., Lake J. A. The ring of life provides evidence for a genome fusion origin of eukaryotes. Nature (2004) 431:152–155.[CrossRef][Medline]

    Rohlf F. J. Consensus indices for comparing classications. Math. Biosci. (1982) 59:131–144.[CrossRef][Web of Science]

    Sanderson M. J., Purvis A., Henze C. Phylogenetic supertrees: Assembling the trees of life. Trends Ecol. Evol. (1998) 13:105–109.[CrossRef]

    Sokal R. R., Rohlf F. J. Taxonomic congruence in the Lepto-podomorpha re-examined. Syst. Zool. (1981) 30:309–325.[Free Full Text]

    Springer M. S., Amrine H. M., Burk A., Stanhope M. J. Additional support for Afrotheria and Paenungulata, the performance of mitochondrial versus nuclear genes, and the impact of data partitions with heterogeneous base composition. Syst. Biol. (1999) 48:65–75.[Abstract/Free Full Text]

    Steel M. A., Dress W. M., Bocker S. Simple but fundamental limitations on supertree and consensus methods. Syst. Biol. (2000) 49:363–368.[Free Full Text]

    Strimmer K., Moulton V. Likelihood analysis of phylogenetic networks using directed graphical models. Mol. Biol. Evol. (2000) 17:875–881.[Abstract/Free Full Text]

    Strimmer K., Wiuf C., Moutlon V. Recombination analysis using directed graphical models. Mol. Biol. Evol. (2001) 18:97–99.[Free Full Text]

    Swofford D. L. When are phylogeny estimates from molecular and morphological data incongruent? In: Phylogenetic analysis of DNA sequences—Miyamoto M. M., Cracraft J., eds. (1991) New York: Oxford University Press. 295–333.

    Thorley J. L. Cladistic information, leaf stability and supertree construction (2000) Bristol, United Kingdom: University of Bristol. PhD thesis.

    Thorley J. L., Page R. D. M. RadCon: Phylogenetic tree comparison and consensus. Bioinformatics (2000) 16:486–487.[Abstract/Free Full Text]

    Thorley J. L., Wilkinson M., Charleston M. A. The information content of consensus trees. In: Advances in data science and classification—Rizzi A., Vichi M., Bock H.-H., eds. (1998) Berlin, Germany: Springer. 91–98.

    Wilkinson M. Common cladistic information and its consensus representation: Reduced Adams and reduced cladistic consensus trees and profiles. Syst. Biol. (1994) 43:343–368.[Abstract/Free Full Text]

    Wilkinson M. More on reduced consensus methods. Syst. Biol. (1995) 44:435–439.[Free Full Text]

    Wilkinson M. Majority-rule reduced consensus trees and their use in bootstrapping. Mol. Biol. Evol. (1996) 13:437–444.[Abstract]

    Wilkinson M., Cotton J. A., Creevey C., Eulenstein O., Harris S. R., Lapointe F.-J., Levasseur C., McInerney J. O., Pisani D., Thorley J. L. The shape of supertrees to come: Tree shape related properties of fourteen supertree methods. Syst. Biol. (2005) 54:419–31.[Abstract/Free Full Text]

    Wilkinson M., Lapointe F.-J., Gower D. J. Branch lengths and support. Syst. Biol. (2003) 52:127–130.[Free Full Text]

    Winkworth R. C., Bryant D., Lockhart P. J., Havell D., Moulton V. Biogeographic interpretation of splits graphs: Least squares optimization of branch lengths. Syst. Biol. (2005) 54:56–65.[Abstract/Free Full Text]

    Xu S. Phylogenetic analysis under reticulate evolution. Mol. Biol. Evol. (2000) 17:897–907.[Abstract/Free Full Text]

    Zaretskii K. Constructing a tree on the basis of a set of distances between the hanging vertices. Uspekhi Matematicheskikh Nauk (1965) 20:90–92. [in Russian].


Add to CiteULike CiteULike   Add to Connotea Connotea   Add to Del.icio.us Del.icio.us    What's this?


This article has been cited by other articles:


Home page
Syst BiolHome page
D. San Mauro, D. J. Gower, T. Massingham, M. Wilkinson, R. Zardoya, and J. A. Cotton
Experimental Design in Caecilian Systematics: Phylogenetic Information of Mitochondrial Genomes and Nuclear rag1
Syst Biol, August 18, 2009; (2009) syp043v1.
[Abstract] [Full Text] [PDF]


Home page
Syst BiolHome page
T. E. Roberts, E. J. Sargis, and L. E. Olson
Networks, Trees, and Treeshrews: Assessing Support and Identifying Conflict with Multiple Loci and a Problematic Root
Syst Biol, June 16, 2009; (2009) syp025v3.
[Abstract] [Full Text] [PDF]


This Article
Right arrow Extract Freely available
Right arrow FREE Full Text (PDF) Freely available
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in ISI Web of Science
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Gauthier, O.
Right arrow Articles by Lapointe, F.-J.
Right arrow Search for Related Content
PubMed
Right arrow Articles by Gauthier, O.
Right arrow Articles by Lapointe, F.-J.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?