Skip Navigation

Systematic Biology 2008 57(5):785-794; doi:10.1080/10635150802424072
This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (PDF) Freely available
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in ISI Web of Science
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Nye, T. M. W.
Right arrow Search for Related Content
PubMed
Right arrow Articles by Nye, T. M. W.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?

© 2008 Society of Systematic Biologists

Trees of Trees: An Approach to Comparing Multiple Alternative Phylogenies

Edited by Olivier Gascuel

Tom M. W. Nye

School of Mathematics and Statistics, Newcastle University Newcastle, UK; E-mail: tom.nye{at}ncl.ac.uk


    Abstract
 Top
 Abstract
 Definitions and Notation
 Meta-Trees and Parsimony
 An Algorithm for Meta-Tree...
 Features of the Meta-NJ...
 Applications
 Comparison with Other Approaches
 Meta-Networks
 Software
 Conclusion
 Acknowledgments
 References
 
Phylogenetic analysis very commonly produces several alternative trees for a given fixed set of taxa. For example, different sets of orthologous genes may be analyzed, or the analysis may sample from a distribution of probable trees. This article describes an approach to comparing and visualizing multiple alternative phylogenies via the idea of a "tree of trees" or "meta-tree." A meta-tree clusters phylogenies with similar topologies together in the same way that a phylogeny clusters species with similar DNA sequences. Leaf nodes on a meta-tree correspond to the original set of phylogenies given by some analysis, whereas interior nodes correspond to certain consensus topologies. The construction of meta-trees is motivated by analogy with construction of a most parsimonious tree for DNA data, but instead of using DNA letters, in a meta-tree the characters are partitions or splits of the set of taxa. An efficient algorithm for meta-tree construction is described that makes use of a known relationship between the majority consensus and parsimony in terms of gain and loss of splits. To illustrate these ideas meta-trees are constructed for two datasets: a set of gene trees for species of yeast and trees from a bootstrap analysis of a set of gene trees in ray-finned fish. A software tool for constructing meta-trees and comparing alternative phylogenies is available online, and the source code can be obtained from the author.

Keywords: Alternative phylogenies; consensus tree; median tree; parsimony; split; topology

Received October 3, 2007; Revised December 21, 2007; Accepted June 4, 2008


Phylogenetic analysis frequently gives rise to several alternative phylogenies for any given fixed set of taxa. For example, using different sets of orthologous genes from a fixed set of species often produces several different tree topologies. In addition, techniques such as bootstrapping and Monte Carlo simulation produce a large number of alternative trees for a given dataset, representing a sample from some underlying distribution of trees. When presented with several alternative phylogenies in this way, identifying the differences and similarities between the phylogenies becomes increasingly difficult as the number of trees and number of species increases, and a systematic approach is required. Although the differences between a handful of trees on fewer than 10 species may be easy to resolve by eye, phylogenetic analyses can potentially produce hundreds of alternative trees on thousands of taxa. There is consequently a pressing need for tools that identify similarities and differences in a collection of trees and represent these meaningfully to biologists.

This article presents an approach to comparing multiple alternative phylogenies via a "tree of trees." Given a set of phylogenies T1, T2, ..., Tn sharing the same set of taxa L, we construct a tree, referred to as a "meta-tree," that groups together phylogenies with similar topologies. A meta-tree is not itself a phylogeny but is a way of representing the relationships between the original set of alternative phylogenies. The set of leaf vertices on a meta-tree corresponds to the original set of alternative phylogenies T1, T2, ..., Tn , and interior vertices on the meta-tree correspond to certain consensus topologies between the phylogenies. In contrast, of course, the leaf vertices on a phylogeny usually correspond to some given set of DNA sequences, whereas the interior vertices correspond to the DNA sequences of ancestral species. The analogy between species trees (where the vertices correspond to DNA sequences) and meta-trees (where the vertices correspond to species tree topologies) motivates our approach to constructing meta-trees. Although for DNA data we might construct a tree that minimizes the number of base changes (i.e., a most parsimonious tree), we analogously seek a meta-tree that minimizes the difference between the species tree topologies associated to its vertices.

The approach adopted in this article is motivated by the need to represent complex datasets, consisting of many alternative phylogenies, in an accessible way to biologists. The space of phylogenetic trees on a fixed set of taxa (Billera and Homes 2001) does not have an intrinsically tree-like structure, and so a collection of alternative phylogenies—or equivalently, a set of points in tree-space—can be represented in many different ways. However, biologists are used to working with tree-like objects, so representing relationships between phylogenies via a meta-tree is a natural approach to adopt, and it enables comparisons to be made in the following way. A meta-tree will cluster similar tree topologies together, so that conflicting evolutionary histories within a set of trees are apparent as separate clades on the meta-tree. For example, the tree for a gene that has undergone a lateral transfer event is likely to conflict considerably with other trees constructed from gene families that have followed the pattern of species evolution more closely. The tree constructed using the laterally transferred gene would probably show up as an outlier on the meta-tree and the source of the conflict would be readily apparent. In this way, meta-trees can reveal the extent of conflicting information in a set of trees as well as identifying which phylogenies cause conflict or support consensus. Meta-trees are also informative about features within trees; for example, providing easy identification of features shared by tree topologies that are clustered together.

It is important to stress that meta-trees do not provide a rigorous direct means of inference about the "true" underlying pattern of evolution but are intended simply as a means of comparing a set of alternative phylogenies and visualizing similarities and differences. The meta-tree for any given set of phylogenies is often not unique. Meta-tree construction does not take into account any relative probabilities or weights of trees in such a set and does not involve modeling any stochastic process that could give rise to different tree topologies. In particular, meta-trees do not provide a rigorous means of suggesting or testing alternative evolutionary histories. However, adopting a model-free approach in this way ensures independence from any single biological or analytical process that gives rise to multiple alternative phylogenies, so that meta-trees can be used to analyze a wide range of different datasets. Exploratory analyses using generic statistical tools such as principal components analysis are used very widely prior to more formal statistical modeling (in particular on large datasets), and the meta-tree method is intended in the same spirit.

A number of different methods for summarizing and representing collections of trees have been explored previously. Consensus tree approaches (Bryant 2003; Felsenstein 2004) attempt to combine the different trees into a single phylogeny, and a number of techniques exist for visualizing alternative trees within a non–tree-like consensus network (Bandelt 1995; Huber and Moulton 2005; Huson and Bryant 2006). Other approaches (Stockham et al. 2002; Hillis et al. 2005) cluster trees together according to topological similarity, in a similar way to meta-trees. Meta-trees offer different insights about a collection of trees, in particular being concerned with comparison of the tree topologies. A more thorough comparison with other methods is given later.

The remainder of the article is structured as follows. Meta-trees are defined and their construction is motivated via analogy with parsimony for DNA data. An heuristic construction algorithm that builds meta-trees according to a local optimality criterion is presented and then applied to two experimental datasets. A software tool for constructing and analyzing meta-trees is available online from http://www.mas.ncl.ac.uk/~ntmwn/phylo_comparison/multiple.html and the source code is available on request from the author.


    Definitions and Notation
 Top
 Abstract
 Definitions and Notation
 Meta-Trees and Parsimony
 An Algorithm for Meta-Tree...
 Features of the Meta-NJ...
 Applications
 Comparison with Other Approaches
 Meta-Networks
 Software
 Conclusion
 Acknowledgments
 References
 
Throughout the rest of the article, we assume we have been given a set of phylogenies T1, T2, ..., Tn with a shared set of leaf vertices L. The requirement that all the phylogenies should share exactly the same set of leaves is not in fact strictly necessary, but it is retained for ease of presentation. Given the set of alternative phylogenies T1, T2, ..., Tn and leaves L, a meta-tree Formula is an unrooted tree with n leaves such that

  • every vertex Formula in Formula has associated to it a species tree Formula with leaves L, and
  • the leaf vertices of Formula are associated to the trees T1, ..., Tn .
Figure 1 shows a simple example. Throughout the article, superscript hats are used to distinguish "meta-objects" from their counterparts, so, for example, a vertex on a phylogeny will be denoted v, whereas a vertex on a meta-tree will be denoted Formula . Meta-trees will be drawn in figures with hollow circles at the nodes.


Figure 1
View larger version (23K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Figure 1 A meta-tree built from four trees on five taxa A, B, C, D, E. There are two internal vertices on the meta-tree that correspond to phylogenies that are not fully resolved.

 
Three points about the definition above are worth noting. First, species trees are considered as purely topological objects—branch lengths on species trees will be ignored. Secondly, as shown later, the topology Formula associated to a vertex Formula is a consensus of all the topologies associated to the neighbors of Formula . Finally, although an equivalent definition can easily be made for rooted meta-trees, this article deals solely with unrooted meta-trees. In the rooted case, a meta-tree represents a form of hierarchical clustering, with the root node corresponding to some consensus of all the trees T1, T2, ..., Tn . Rooted meta-trees in which each tree Formula is a consensus of the trees associated to descendants of Formula were indeed explored in the preparation of this article. However, these methods of meta-tree construction were found to be less useful for making comparisons between alternative trees and were less satisfactory in terms of methodology than the unrooted approach presented here.

The length of each edge on a meta-tree is equal to its score, calculated by comparing the species tree topologies associated to the vertices at either end of the edge. The more similar these two topologies, the lower the score and the shorter the meta-edge. The total score of the meta-tree is the sum of its edge scores, and we aim to construct a meta-tree with the minimum total score. By minimizing the total score in this way, a meta-tree that best represents similarities and differences between the trees T1, T2, ..., Tn is constructed.

Species tree topologies are scored using the collection of bipartitions or splits of the leaves L that they represent. A split is a partition of L into two disjoint sets. Every species tree T has associated to it a collection of splits of the leaf vertices L, corresponding to cutting T in two at each of its branches. Each species tree T is uniquely determined by its associated collection of splits, and in a slight abuse of notation the same symbol is used to denote the tree and its associated splits. Although trees are represented by sets of splits, it is important to remember that not every collection of splits represents a tree: an additional compatibility condition is required.

The score assigned to a meta-tree edge ê between vertices Formula is defined by counting the number of splits present at one end of the edge but missing from the other:


Formula

This is the Robinson-Foulds distance between the trees Formula and Formula (Robinson and Foulds 1981), also known as the "bipartition distance." The total score for a meta-tree Formula is then:


Formula

where d(T, T') denotes the Robinson-Foulds distance between two trees T, T'. Although other distance metrics such as Tree Bisection and Recombination (Swofford et al. 1996), quartet distance (Estabrook et al. 1985), or nearest-neighbor interchange distance (Waterman and Smith 1978) could be used to score meta-trees, use of the Robinson-Foulds metric is central to our construction algorithm. Further remarks about the choice of metric are made later in the article.

Various consensus methods build trees in terms of collections of splits, although there is a wide variety of consensus methods that use different types of topological unit (Bryant 2003). The majority consensus is based on splits, and it is useful to recall its definition here. The majority consensus of trees T1, T2, ..., Tn is the set of splits contained in strictly more than half the trees. A split p is therefore included in the majority consensus if and only if


Formula

where Ii (p)=1 if pisinTi and Ii (p)=0 otherwise. Ii is called the indicator function for Ti collection of the splits defined in this way constitutes a tree because, by the pigeonhole principle, any two splits in the collection are both contained in at least one tree Ti and are therefore compatible.


    Meta-Trees and Parsimony
 Top
 Abstract
 Definitions and Notation
 Meta-Trees and Parsimony
 An Algorithm for Meta-Tree...
 Features of the Meta-NJ...
 Applications
 Comparison with Other Approaches
 Meta-Networks
 Software
 Conclusion
 Acknowledgments
 References
 
As mentioned previously, searching for the meta-tree with minimum total score is in many ways analogous to constructing a maximum-parsimony tree T for a set of DNA sequence data, and it is useful to recall that algorithm. At each stage some fixed topology for the phylogeny T is considered. Given this topology, we need to work out the DNA sequence at the internal vertices on T and assign lengths to the edges of T. The length, or score, of each edge is the number of DNA substitutions along that edge. Sankoff's algorithm (Sankoff, 1975; or some variant) is used to determine the DNA sequences at internal vertices on the phylogeny in such a way that the total edge score is minimized. We can envisage constructing a maximum parsimony meta-tree in an entirely analogous way, using a variant of Sankoff's algorithm that works with splits of the leaf set L rather than DNA letters. However, unlike DNA letters, splits cannot be treated independently—the set of splits at each internal node on a meta-tree must be compatible—and it is easy to construct examples in which a naive application of Sankoff's algorithm produces incompatible sets of splits. In fact, the problem of finding the maximum parsimony meta-tree can be framed as a Steiner tree problem (Hwang et al. 1992), most forms of which are NP-complete. Heuristic approaches to meta-tree construction therefore need to be considered.

Maximum-parsimony meta-trees can be characterized at a local level using the following known relationship between the majority consensus of T1, T2, ..., Tn and parsimony with the Robinson-Foulds metric. Consider how to minimize the total score of the meta-tree Formula when it has a star topology with a single internal vertex Formula . The species tree Formula must be chosen so as to minimize


Formula

Such a tree is called a median tree of T1, T2, ..., Tn . When the distance function d is the Robinson-Foulds metric, as assumed here, the majority-consensus tree is a median tree (Barthélemy and McMorris 1986), and so the meta-tree score can be minimized by taking Formula to be the majority consensus of T1, T2, ..., Tn . This result is particularly appealing because not only does it minimize the total meta-tree score but it also ensures the set of splits associated to the internal vertex is tree-like.

This result for the star topology generalizes to cases when the topology of the meta-tree Formula is arbitrary. In any meta-tree for which the states Formula minimize the total score, each tree Formula must be a median tree of its neighbors. In particular, if the meta-tree is fully resolved, then each tree Formula is the majority consensus of its neighbors, as illustrated by Figure 2. When Formula is not fully resolved, for vertices with an even number of neighbors, it is possible that more than one collection of splits Formula will minimize the total score, but the majority consensus is always one of these solutions. This local property for vertices in minimum-score meta-trees forms the keystone of the heuristic construction algorithm described below.


Figure 2
View larger version (23K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Figure 2 A vertex Figure 2 in a fully resolved meta-tree with minimal score. (a) A split p is included in Figure 2 if it is contained in two or more neighbors. (b) If p is in none or only one of the neighbors, it is not included in Figure 2.

 

    An Algorithm for Meta-Tree Construction
 Top
 Abstract
 Definitions and Notation
 Meta-Trees and Parsimony
 An Algorithm for Meta-Tree...
 Features of the Meta-NJ...
 Applications
 Comparison with Other Approaches
 Meta-Networks
 Software
 Conclusion
 Acknowledgments
 References
 
Given a set of phylogenies T1, T2, ..., Tn on a set of taxa L, consider building a meta-tree according to the following algorithm.
  1. Let Formula be the meta-tree on T1, T2, ..., Tn with the star topology and let Z denote the single internal vertex. Fix the topology TZ associated to Z by taking the majority consensus of T1, T2, ..., Tn .
  2. At the rth step, pick two leaves A,B of Formula to agglomerate, and denote the remaining leaves Z1, ..., Zk .
  3. Form a new meta-tree Formula by joining A and B together at a new vertex X, so that the sub-tree on the vertices {X, Z, Z1, ..., Zk } has the star topology with central vertex Z. Fix the states TX and TZ as the majority consensus of their respective neighbors by taking:


    Formula 1

    (1)


    Formula 2

    (2)
    where maj denotes the majority consensus of sets of splits.

  4. Repeat steps 2 and 3 for every pair of leaves A,B of Formula , and identify the pair that minimizes the score of Formula .
  5. Replace the leaves A and B with vertex X to define Formula and go back to step 2.

I call this algorithm "meta-NJ" on account of its similarity to the neighbor-joining algorithm for constructing phylogenies from evolutionary distances. The algorithm has complexity O(mn3) where n is the number of trees and m is the total number of splits in these trees: at each of O(n) steps, O(n2) agglomerations are considered, and each of these involves deciding whether to include O(m) splits in TX and TZ. The algorithm is not fully defined by the steps above since it is necessary to solve Equations 1 and 2 simultaneously. The solution is derived in the online Appendix, and a worked example illustrating the algorithm is given in the online Supplementary Material (available at http://www.systematicbiology.org).

As seen in the previous section, for a fixed topology on Formula , the minimum-score meta-tree has every tree Formula given by the majority consensus of the neighbors of Formula . This will be referred to as the local optimality criterion. However, as it is defined above, it is not necessarily the case that the algorithm constructs locally optimal meta-trees for the following reason. After several iterations the vertices A, B, Z1, ..., Zk at step 2 will no longer represent topologies from the original tree set {T1, ..., Tn } but might instead consist of vertices formed by earlier agglomerations. At such a stage, although TX and TZ are guaranteed to be the majority consensus of their neighbors by the algorithm, performing the agglomeration of A and B might result in the vertices A, B, Z1, ..., Zk no longer being locally optimal. In order to ensure locally optimality for these nodes, certain exceptional splits need to be removed from the trees TZi following agglomeration. This process is described in the Appendix. With this modification in place, the meta-NJ algorithm is a greedy method of building meta-trees satisfying the local optimality condition.


    Features of the Meta-NJ Algorithm
 Top
 Abstract
 Definitions and Notation
 Meta-Trees and Parsimony
 An Algorithm for Meta-Tree...
 Features of the Meta-NJ...
 Applications
 Comparison with Other Approaches
 Meta-Networks
 Software
 Conclusion
 Acknowledgments
 References
 
Ties in the Score
At step 4 of the algorithm, it is possible that two different pairs of leaves, say A1, B1 and A2, B2, give rise to meta-trees Formula and Formula that both have the minimum score. When this occurs one of the possibilities is simply chosen at random. Similarly, it can be the case that no refinement Formula has a lower score than Formula at step 4 of the algorithm, in which case one refinement with Formula equal to score Formula is picked at random and used at step 5. Although a more exhaustive search for an optimal meta-tree could be made by keeping track of all the different possibilities, it is more pragmatic to resolve ties in this way since the overall aim is to obtain a meta-tree that represents the data meaningfully in as efficient a way as possible.

Zero-Length Branches
For some pairs of leaves A,B, the solution to Equations 1 and 2 has X=Z so that the length of branch XZ is zero. Such refinements are not permitted, as they make no difference to the meta-tree. Similarly, it is possible for one of the branches AX, BX, or ZZi (i=1, ..., k) to have zero length. If such a refinement is accepted, the resulting meta-tree contains a multi-furcation (an internal node with degree greater than 3). It can also be the case that one of the original leaves T1, ..., Tn of the meta-tree is attached to the meta-tree by a branch of length zero, effectively becoming an interior vertex on the meta-tree. Meta-trees in which the original leaves T1, ..., Tn become internal vertices often look unusual, and so in the event of a tie in the scores of different refinements, preference is given to refinements without zero-length edges.

Trees with Identical Topologies
It has tacitly been assumed that the original trees T1, ..., Tn have distinct topologies, because otherwise any matching pair will be joined by a meta-edge of zero length. In practice, any trees with identical topologies are absorbed into a single meta-tree vertex before tree construction begins. The number of times a particular topology occurs in a collection of trees therefore does not affect how the meta-tree for the collection is constructed. This is unlike some consensus tree approaches in which the frequency of each feature present in a collection of trees determines the final representation. As mentioned previously, the meta-tree approach is not intended as a direct means of inference about the underlying pattern of evolution: although the frequency of a topology in a collection sometimes represents a relative probability for that tree, this information is not used to build the meta-tree.

Other Distance Metrics
It is interesting to consider how the meta-NJ algorithm might be generalized to work with other distance metrics than the Robinson-Foulds metric. Step 3 of the algorithm is stated above in terms of the majority consensus rather than as an optimization problem in terms of distances. However, as we have already seen, the majority consensus minimizes the scores of the edges locally. An alternative way of formulating the problem is to search for tree-like sets of splits TX and TZ that solve


Formula 3

(3)
The Robinson-Foulds metric might be replaced with any other suitable metric in Equation 3, leading to a construction algorithm for meta-trees based on a different metric. However, solving such optimization problems is much more difficult than with the Robinson-Foulds metric in general, and so meta-tree construction becomes more computationally intensive. Consider the quartet distance, for example, as this is quite similar to the Robinson-Foulds metric, being determined by the presence or absence of certain topological sub-units in trees. Various aspects of the methodology break down with the change of metric. First, there is no analogue of the relationship between median trees and the majority consensus for the quartet distance. Secondly, the compatibility conditions for quartet topologies is not as simple as that for splits: every choice of topology for two distinct quartets can be represented on a tree, but sets of three quartets may have incompatible topologies. The machinery based on splits therefore does not extend to quartets, and a different approach is needed. As a general solution, some local search of tree space could be made in order to fix TX and TZ, but the number of trees that would need to be considered is very large. It is precisely because the median tree can be computed so rapidly for the Robinson-Foulds metric that the meta-NJ algorithm can build meta-trees quickly.

One disadvantage of the Robinson-Foulds metric is that a single "mobile" taxon in two trees that are essentially very similar can lead to high values of the metric. If prior biological knowledge were available, such taxa could be removed from a collection of trees before performing meta-tree construction. Alternatively, a mobile taxon could be recognized by viewing a meta-tree. It could then be removed from the dataset, and meta-tree construction re-performed.


    Applications
 Top
 Abstract
 Definitions and Notation
 Meta-Trees and Parsimony
 An Algorithm for Meta-Tree...
 Features of the Meta-NJ...
 Applications
 Comparison with Other Approaches
 Meta-Networks
 Software
 Conclusion
 Acknowledgments
 References
 
In this section meta-trees are constructed for two datasets: a set of gene trees for species of yeast (Rokas et al. 2003); and trees from a bootstrap analysis of a set of gene trees in ray-finned fish (Li et al. 2007). Rokas et al. (2003) constructed maximum likelihood (ML) gene trees for 106 sets of orthologs in eight species of yeast as part of a study of incongruence between gene trees. The trees exhibit 23 different topologies, with some topologies occurring several times. Figure 3 shows a meta-tree for the Rokas data set. The leaves correspond to the 23 different topologies, and the number of times each topology occurs is marked on the figure. When a topology is associated with a unique gene, the gene name is shown. A description of all the topologies and their associated genes is listed in the online Supplementary Material. However, the same information is directly accessible with the software for viewing meta-trees: a single printed image like Figure 3 cannot convey all the information that can be obtained by viewing a tree interactively.


Figure 3
View larger version (80K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Figure 3 A meta-tree for the Rokas yeast dataset. There are 23 distinct topologies, many of which occur more than once. The dotted lines mark regions that are defined by the presence or absence of certain splits. Line A: vertices below and right of the line all contain the clade (S. mikatae, S. kudriavzevii, S. bayanus), whereas the clade is absent from nearly all vertices above and left of the line. Similarly, nearly all vertices above and left of A contain the clade (S. cerevisiae, S. paradoxus), whereas every vertex below and right does not. Line B: every vertex below the line contains (S. bayanus, S. kudriavzevii); the clade is absent from every vertex above the line. Line C: every vertex to the left of the line (apart from topology 11) contains the clade (C. albicans, S. castellii, S. kluyveri); the clade is absent from every vertex to the right.

 
The meta-tree in Figure 3 roughly consists of four sub-trees joined to an unresolved central vertex (the vertex has six neighbors). Using the software it is possible to identify splits that are present in a particular sub-tree but absent elsewhere on the meta-tree. This process was used to obtain the different regions marked on the meta-tree, each of which is defined by the presence or absence of a certain split. These regions break the meta-tree up into pieces that correspond to the four sub-trees produced by the meta-NJ algorithm. Outlying trees are immediately apparent from the meta-tree, such as the gene tree for the gene YDR484W, which is connected to the meta-tree by a branch of length 6. Only one of the splits represented by this tree is found elsewhere on the meta-tree, so the tree is highly anomalous.

Li et al. (2007) recently studied a set of 10 orthologous genes in 14 species of fish. Their paper describes a genome-wide method for identifying gene markers for phylogenetic inference. Figure 4 shows a meta-tree for the 10 ML gene trees they obtained. The specific tree topologies are show in Li et al. (2007), supplementary Figure 4. The leaf branches on the meta-tree are long, showing that the gene trees are quite dissimilar, and the internal vertices on the meta-tree correspondingly consist of highly unresolved trees. Overall, the meta-tree has a star-like appearance, with long edges radiating from a few central vertices, and this is typical for sets of trees with a high degree of conflict. At the extreme, if none of the topologies shared any splits, the meta-tree would have the star topology.


Figure 4
View larger version (26K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Figure 4 A meta-tree for maximum likelihood phylogenies for 10 different sets of orthologous genes in ray-finned fish. Each leaf vertex represents a different gene tree and is labeled with the name of the gene. The leaves are attached by relatively long branches to the meta-tree, indicating a high degree of conflict between the alternative topologies. The tree associated to the central unresolved vertex has the same topology as the majority consensus of the 10 gene trees.

 
A bootstrap analysis was performed in order to assess whether the conflict apparent in Figure 4 was caused by statistically significant differences in the phylogenies for each gene, or whether the conflict was instead due to the lack of a strong phylogenetic signal from each gene. The alignments for the analysis were kindly provided by Chenhong Li. The meta-tree in Figure 5 was constructed in order to visualize the results of the analysis, showing 10 bootstrap replicates for each of the 10 genes. It should be emphasized that the meta-tree is not providing the statistical apparatus for analysis—the bootstrap procedure plays that role—but is used to visualize and interpret the results. The meta-tree again has an overall star-like appearance, indicating a high degree of conflict in the bootstrap topologies. Moreover, the trees from each gene do not generally form separate clades on the meta-tree. This suggests that the phylogenetic signal from each gene is relatively weak, and that the conflict apparent in Figure 4 is not due to the individual genes having genuinely different evolutionary histories but is really caused by the low phylogenetic signal contained in each gene. The central branch that divides the meta-tree into two pieces is of interest. Fifty of the 63 trees lying in the piece below the branch contain the clade (Danio rerio, Oncorhynchus mykiss, Ictalurus punctatus, Semotilus atromaculatus), whereas none of the 37 trees in the piece above contain this clade. The split is therefore not present in the majority consensus of the 100 different trees, and it is missing from all the bootstrap relicates for the genes Zic1, Ptr, and myh6. The vertex at the top of the central branch in Figure 5 is associated with the majority consensus topology, which contains only three non-trivial splits.


Figure 5
View larger version (150K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Figure 5 A meta-tree for a bootstrap analysis of 10 orthologous genes in ray-finned fish. Ten bootstrap replicates are shown for each gene, and each leaf vertex is labeled with the name of the corresponding gene. The replicates from each gene do not generally form separate clades on the meta-tree, as might be the case if the genes had evolved according to genuinely different trees.

 

    Comparison with Other Approaches
 Top
 Abstract
 Definitions and Notation
 Meta-Trees and Parsimony
 An Algorithm for Meta-Tree...
 Features of the Meta-NJ...
 Applications
 Comparison with Other Approaches
 Meta-Networks
 Software
 Conclusion
 Acknowledgments
 References
 
Consensus Network Methods
As a means of visualizing and comparing several alternative phylogenies, the meta-tree approach complements existing consensus network methods (Holland and Moulton 2003) but offers certain advantages. Generally, a consensus network for a set of trees T1, ..., Tn broadly resembles some underlying consensus tree but with non tree-like features in regions where the different trees conflict. In contrast, the primary objects represented by a meta-tree are the trees T1, ..., Tn and their relationship to each other, so that it is easy to identify which trees cause conflict or support consensus. However, information about where the trees disagree is available by identifying which meta-tree vertices contain a particular split, or by comparing the trees associated to vertices at either end of any meta-edge. With consensus networks, it is not generally apparent which of the original phylogenies support a given clade or branch, whereas this information is readily available from a meta-tree. In addition, when a set of phylogenies contains a substantial degree of conflict consensus networks are likely to be visually complex and highly connected, making interpretation difficult. A meta-tree, on the other hand, provides a familiar simple environment in which collections of alternative phylogenies can be compared.

The Rokas dataset has previously been analyzed using consensus networks (Holland et al. 2004), and it is interesting to compare the network for the ML gene trees with the meta-tree shown in Figure 3. One of the ML gene tree topologies in the Rokas dataset is repeated 41 times. The majority consensus tree has the same topology in fact, and this topology dominates the consensus network. The network broadly resembles the majority consensus tree, but with non–tree-like features, indicating that the order of branching in the clade (C. albicans, S. castellii, S. kluyveri) is subject to a degree of conflict within the dataset, as well as the positioning of the species S. bayanus and S. kudriavzevii. The representation of the same dataset by the meta-tree is of course very different visually, because the meta-tree displays relationships between trees rather than the topological details of the trees themselves—the tree must be viewed interactively to extract that information. Instead, the meta-tree shows structure within the collection of topologies, grouping together genes with similar tree topologies and bringing attention to unusual topologies. However, the positioning of S. bayanus and S. kudriavzevii, identified on the consensus network as being contentious, is similarly important in the structure of the meta-tree (line B on Figure 3).

Clustering and Multipolar Consensus
The methods of Stockham et al. (2002) and Bonnard et al. (2006) are concerned with representing collections of trees via a number of different consensus trees rather than via a single global consensus tree, due to the loss of information that entails. The method of Stockham et al. (2002) involves clustering trees and so is somewhat similar to the meta-tree approach, although the overall aim of the methodology is quite different.

Stockham et al. (2002) assume the set of trees T1, ..., Tn represents a certain probability distribution, and they aim to identify a smaller set of consensus trees that approximates the original distribution in an efficient way. They use statistical criteria to balance information loss due to the reduced number of trees against the complexity of the approximation. The consensus trees are obtained by clustering the original trees T1, ..., Tn according to some distance metric and taking a consensus of each resulting cluster. Various different distance metrics and clustering algorithms are investigated. The aim of this method is therefore markedly different from ours, even if the methods are similar. Although the paper of Stockham et al. (2002) is concerned with statistical criteria for summarizing a set of trees and the distribution they represent, meta-trees are intended for comparing alternative trees, identifying topological differences between them, and identifying trees that are topological outliers. The method of Stockham et al. (2002) does not provide such capabilities. Similarly, although Stockham et al. (2002) investigate a variety of different tree metrics and distance-based clustering algorithms, the meta-NJ algorithm arises very naturally from the relationship between median trees and the majority consensus, with clustering and consensus performed simultaneously rather than as separate processes.

A similar approach has been proposed by Bonnard et al. (2006), in which a number of consensus trees are used to display all the splits in a collection of trees that lie above a fixed frequency threshold. The method involves pooling all the splits in the collection, thereby losing the identity of different trees from the outset. Like the method of Stockham et al. (2002), it therefore does not represent a means of comparing or visualizing alternative topologies.

Multi-dimensional Scaling
Hillis et al. (2005) represent trees as points in the plane, with the distances between the points determined by MDS so that they approximate the distances between trees measured with the Robinson-Foulds (or other) metrics. Thus, trees with similar topologies tend to be clustered together, and unusual topologies show up as outliers. The authors provide an excellent software package for performing MDS and analysing the results. The user can select clusters of trees and view the corresponding majority consensus tree. In spirit the approach is similar to ours, as it is concerned with identifying and visualizing patterns or structure within collections of trees. However, the MDS approach is not directly informative as to which features are shared by trees in a cluster or how different clusters are related. In particular, different directions in the plane do not correspond to different topological features in any systematic way. By contrast, meta-tree edges represent relationships between trees in terms of loss and gain of splits. The software tool for viewing meta-trees shows the user which splits are present in or absent from given subsets of trees and shows how splits are lost or gained across each meta-edge.

Multidimensional scaling was used to analyze the Rokas yeast dataset and one set of results is shown in Figure 6. The spatial arrangment of the different topologies on the MDS plot shows close similarity to the meta-tree in Figure 3, although the arrangement varies slightly depending on the starting conditions for the MDS analysis. For example, topology 11 is a clear outlier on the plot, with topologies 2 and 14 also lying away from the main cluster of points. These topologies are also outliers on the meta-tree. Although the points on the MDS plot do not form distinct clusters, points lying close to each other appear as clades on the meta-tree. For example, topologies 2, 9, 18, and 20 form a group on the top right-hand side of the MDS plot, whereas topologies 4, 16, 19, and 23 form a group bottom center. Both these groups appears as separate clades on the meta-tree. In addition to information about the overall similarity of topologies, as indicated by relative position, the meta-tree can be used to identify which splits are present at each of the meta-vertices. For example, whereas the MDS plot does not explain what features of topology 11 make it an outlier, the meta-tree shows that only one of the (non-trivial) splits in topology 11 is found in any other topology. Similarly, although the MDS software can be used to obtain the consensus of a group of points, it does not reveal how those points are different from the other topologies. This is in contrast to the meta-tree, where it is straightforward to work out which splits characterize a clade; for example, the group of topologies below line A in Figure 3. An important advantage of the meta-tree, then, is the ability to relate the position of the topologies on the meta-tree to the splits that they contain.


Figure 6
View larger version (10K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Figure 6 Multidimensional scaling results for the yeast dataset. The analysis was performed using the (unweighted) Robinson-Foulds distance. The numbers in the figure refer to the same topologies as in Figure 3.

 

    Meta-Networks
 Top
 Abstract
 Definitions and Notation
 Meta-Trees and Parsimony
 An Algorithm for Meta-Tree...
 Features of the Meta-NJ...
 Applications
 Comparison with Other Approaches
 Meta-Networks
 Software
 Conclusion
 Acknowledgments
 References
 
The space of phylogenetic trees on a fixed set of taxa does not have an intrinsically tree-like structure, and for any collection of alternative tree topologies there may be several meta-trees with minimum score. This will happen more often when the different tree topologies are densely clustered in tree-space. Under such circumstances, two tree topologies that are similar might end up being separated by a relatively long path on a meta-tree. It might be more appropriate to visualize some sets of alternative topologies using meta-networks. A meta-network can be defined as a set of vertices Formula linked by edges Ê such that every vertex Formula is associated to a tree topology Formula on some fixed set of taxa L. Vertices in the meta-network are joined by an edge if the associated tree topologies are similar.

One way of constructing meta-networks is suggested by analogy with the construction of median networks for DNA sequence data (Bandelt et al. 1995; Barthélemy 1989). Given a set of tree topologies T1, ..., Tn on L, the median meta-network is constructed in the following way. For every set of three tree topologies Ti , Tj , Tk (i<j<k), the median tree maj{Ti , Tj , Tk } is computed and added to the original set of trees. This process is applied iteratively, expanding the set of topologies at each step, but it terminates after a finite number of steps. The resulting set is called the median closure of the original set T1, ..., Tn . The median meta-network has one vertex for each tree topology in the median closure, and vertices are joined by an edge when the corresponding tree topologies differ by just one split. Median meta-networks can be constructed for a given set of trees by first representing each tree as a binary sequence indicating presence or absence of each split and then analyzing this binary sequence data with the program SplitsTree4 (Huson and Bryant 2006). The online Supplementary Material contains a median meta-network constructed in this way for the fish gene trees.

Median networks tend to be highly connected and often contain the skeletons of high-dimensional cubes, which can make them difficult to interpret. When the initial set of tree topologies are relatively dissimilar the median closure can potentially be very large and the network resembles a dense net thrown over a region of tree-space. So whereas meta-trees can fail to show certain similarities between trees by providing too sparse a "coverage" of tree-space, the very dense coverage produced by a median meta-network can make extracting these similarities difficult. One future possibility is to reduce the complexity of median meta-networks by pruning vertices and edges in order to achieve an intermediate level of coverage. Various existing methods are available (Huber and Moulton 2005).


    Software
 Top
 Abstract
 Definitions and Notation
 Meta-Trees and Parsimony
 An Algorithm for Meta-Tree...
 Features of the Meta-NJ...
 Applications
 Comparison with Other Approaches
 Meta-Networks
 Software
 Conclusion
 Acknowledgments
 References
 
Viewing a meta-tree interactively can reveal much more information than a single static image. Software for constructing and viewing meta-trees is available from the author and is also available online as a java applet at www.mas.ncl.ac.uk/~ntmwn/phylo_comparison/multiple.html. The meta-tree viewer consists of two panels side-by-side, one containing the meta-tree and the other displaying species trees. Mouse-clicks on meta-tree vertices cause the corresponding species trees to be displayed, whereas clicks on species tree branches cause the meta-tree vertices containing the split corresponding to that branch to be highlighted. The user can rapidly explore the internal structure of the meta-tree and identify which trees support any given split.


    Conclusion
 Top
 Abstract
 Definitions and Notation
 Meta-Trees and Parsimony
 An Algorithm for Meta-Tree...
 Features of the Meta-NJ...
 Applications
 Comparison with Other Approaches
 Meta-Networks
 Software
 Conclusion
 Acknowledgments
 References
 
Meta-trees are a tool for comparing several alternative tree topologies and they can be constructed efficiently with the algorithm described in this article. They complement other existing tools for summarizing and visualizing collections of trees but reveal different types of information. The applications presented above illustrate some of the uses of meta-trees for comparing collections of tree topologies, identifying structure within such collections, and visualizing the results of sampling procedures, although there are a number of potential applications beyond those explored here. Meta-networks might offer more valid comparisons for collections of topologies that are dense in tree-space.


    Acknowledgments
 Top
 Abstract
 Definitions and Notation
 Meta-Trees and Parsimony
 An Algorithm for Meta-Tree...
 Features of the Meta-NJ...
 Applications
 Comparison with Other Approaches
 Meta-Networks
 Software
 Conclusion
 Acknowledgments
 References
 
The author would like to thank Wally Gilks for his help preparing an earlier draft of the manuscript and for stimulating conversations about the work. Antonis Rokas and Chenhong Li generously provided the data sets analyzed in the Applications section. The anonymous reviewers and Associate Editor gave several very useful comments that lead to improvements in the article and the methodology itself.


    References
 Top
 Abstract
 Definitions and Notation
 Meta-Trees and Parsimony
 An Algorithm for Meta-Tree...
 Features of the Meta-NJ...
 Applications
 Comparison with Other Approaches
 Meta-Networks
 Software
 Conclusion
 Acknowledgments
 References
 

    Bandelt H. Combination of data in phylogenetic analysis. Plant Syst. Evol. (1995) 9:355–361.

    Bandelt H., Forster P., Sykes B., Richards M. Mitochondrial portraits of human population using median networks. Genetics (1995) 141:743–753.[Abstract]

    Barthélemy J. From copair hypergraphs to median graphs with latent vertices. Discrete Math. (1989) 76:9–28.[CrossRef]

    Barthélemy J., McMorris F. The median procedure for n-trees. J. Classif. (1986) 3:329–334.[CrossRef]

    Billera L., Homes S. Geometry of the space of phylogenetic trees. Adv. Appl. Math. (2001) 27:733–767.[CrossRef]

    Bonnard C., Berry V., Lartillot N. Multipolar consensus for phylogenetic trees. Syst. Biol. (2006) 55:837–843.[Abstract/Free Full Text]

    Bryant D. A classification of consensus methods for phylogenetics. In: Bioconsensus—Janowitz M. F., Lapointe F.-J., Morris F. R., Mirkin B., Roberts F. S., eds. (2003) Providence, RI, USA: American Mathematical Society. 163–183.

    Estabrook G., McMorris F., Meacham C. Comparison of undirected phylogenetic trees based on subtrees of four evolutionary units. Syst. Zool. (1985) 34:193–200.[Abstract/Free Full Text]

    Felsenstein J. Chapter 30: Consensus trees and distance between trees. In: Inferring phylogenies (2004) Sunderland, Massachusetts: Sinauer. 521–538.

    Hillis D., Heath T., St. John K. Analysis and visualization of tree space. Syst. Biol. (2005) 54:471–482.[Abstract/Free Full Text]

    Holland B., Huber K., Moulton V., Lockhart P. Using consensus networks to visualize contradictory evidence for species phylogeny. Mol. Biol. Evol. (2004) 21:1459–1461.[Abstract/Free Full Text]

    Holland B., Moulton V. Consensus networks: A method for visualising incompatibilities in collections of trees. In: Algorithms in Bioinformatics—Benson G., Page R., eds. (2003) Berlin: Springer Lecture Notes in Computer Science. 165–176. Springer.

    Huber K., Moulton V. Phylogenetic networks. In: Mathematics of phylogeny and evolutio—Gascuel O., ed. (2005) Oxford, UK: Oxford University Press. 178–204.

    Huson D., Bryant D. Application of phylogenetic networks in evolutionary studies. Mol. Biol. Evol. (2006) 23:254–267.[Abstract/Free Full Text]

    Hwang F., Richards D., Winter P. The Steiner tree problem (1992) North-Holland, Amsterdam.

    Li C., Ortí G., Zhang G., Lu G. A practical approach to phylogenetics: The case of ray-finned fish as a case study. BMC Evol. Biol. (2007).

    Robinson D., Foulds L. Comparison of phylogenetic trees. Math. Biosci. (1981) 53:131–147.[CrossRef][Web of Science]

    Rokas A., Williams B., King N., Caroll S. Genome-scale approaches to resolving incongruence in molecular phylogenies. Nature (2003) 425:798–804.[CrossRef][Medline]

    Sankoff D. Minimal mutation trees of sequences. SIAM J. Appl. Math. (1975) 28:35–42.[CrossRef]

    Stockham C., Wang L., Warnow T. Statistically based postprocessing of phylogenetic analysis by clustering. Bioinformatics (2002) 18:S285–S293.[Abstract]

    Swofford D., Olson G., Wadell P., Hillis D. Phylogenetic inference. In: Molecular systematic—Hillis D., Moritz C., Mable B., eds. (1996) Sunderland, Massachusetts: Sinauer Associates. 411–501.

    Waterman M., Smith T. On the similarity of dendrograms. J. Theor. Biol. (1978) 73:789–800.[CrossRef][Web of Science][Medline]


Add to CiteULike CiteULike   Add to Connotea Connotea   Add to Del.icio.us Del.icio.us    What's this?



This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (PDF) Freely available
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in ISI Web of Science
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Nye, T. M. W.
Right arrow Search for Related Content
PubMed
Right arrow Articles by Nye, T. M. W.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?