© 2008 Society of Systematic Biologists
Parsimony via Consensus
Edited by Bininda-Emonds Olaf, Roderic Page
1 Department of Mathematics, University of California Berkeley, USA; E-mail: tbruen{at}math.berkeley.edu
2 Department of Mathematics, University of Auckland New Zealand; E-mail: d.bryant{at}auckland.ac.nz
| Abstract |
|---|
|
|
|---|
The parsimony score of a character on a tree equals the number of state changes required to fit that character onto the tree. We show that for unordered, reversible characters this score equals the number of tree rearrangements required to fit the tree onto the character. We discuss implications of this connection for the debate over the use of consensus trees or total evidence and show how it provides a link between incongruence of characters and recombination.
Keywords: Compatibility; homoplasy; maximum parsimony; SPR distance
Received April 3, 2006; Revised August 26, 2006; Accepted June 6, 2007
The parsimony length of an unordered, reversible character on a tree equals the minimum number of state changes required to fit the character onto a tree. Here we introduce an alternative way to view parsimony length, one that is dual to the original definition. We show how the parsimony length of a character equals the minimum number of changes in the tree required to fit the tree onto the character. This new perspective on parsimony has applications to consensus trees and recombination detection.
The first application we discuss is how the reformulation of parsimony provides a closer link between parsimony-based analysis and supertree methods. We demonstrate that the maximum parsimony tree can be viewed as a type of median consensus tree, where the median is computed with respect to the SPR distance (see below). As well, the result shows how to conduct a parsimony-based analysis not just on characters but on trees, without having to recode the trees as binary character matrices. This opens the way to a hybrid between the consensus approach and the total evidence approach, where the data are a mixture of characters, trees, and subtrees.
The second application of the main result is to the analysis of pairs of characters. We show that the score of the maximum parsimony tree for two characters is a simple function of the smallest number of recombinations required to explain the incongruence between the characters without homoplasy. This result provides the basis of a highly efficient test for recombination (Bruen et al., 2006).
Here and throughout the paper we assume that all phylogenetic trees are fully resolved (bifurcating) and that by "parsimony" we refer to Fitch or Wagner parsimony, where the character states are unordered and reversible. Some of the results presented here can be extended to other forms of parsimony (Bruen, 2006c), and possibly to incompletely resolved trees, but these lie beyond the scope of this paper.
Note that in this paper we are dealing with unrooted SPR rearrangements, which are those used in tree searches. There is a related, but distinct, concept of rooted SPR rearrangements, where the rearrangements are restricted to obey a type of temporal constraint (Song, 2003). It is this latter class of rooted SPR rearrangments that are used to model lateral gene transfers and recombination. It would be a worthwhile, but challenging, goal to investigate whether any of the results on unrooted SPR rearrangements in this paper can be extended to rooted SPR rearrangements.
| Linking Parsimony with SPR |
|---|
|
|
|---|
A subtree-prune and regraft (SPR) rearrangement is an operation on phylogenetic trees whereby a subtree is removed from one part of the tree and regrafted to another part of the tree, see Figure 1 (Felsenstein, 2004; Swofford et al., 1996). These SPR rearrangements are widely used by tree searching software packages like PAUP (Swofford, 1998) and Garli (Zwickl, 2006). The SPR distance between two trees can be defined as the minimum number of SPR rearrangements required to transform one tree into the other (Hein, 1990; Allen and Steel, 2001; Goloboff, 2007). For example, the two trees T1 and T3 in Figure 1 can be transformed into each other using a minimum of two SPR rearrangements, via the tree T2, so their SPR distance is two.
|
The parsimony length of a character on a tree is the minimum number of steps required to fit that character on the tree, as computed by the algorithm of Fitch (1971). The length of a character Xi on a tree T is denoted
(Xi,T). A character with ri states therefore has parsimony length at least (ri–1), as every state not at the root has to arise at least once. A character is compatible with a tree if it requires at most (ri–1) changes on that tree (Felsenstein, 2004). Instead of fitting a character onto a tree we could just as well fit the tree onto the character. If the character and the tree are compatible, then we have a perfect fit. When there is not a perfect fit we can measure how many SPR rearrangements are required to give a tree that does make a perfect fit. It turns out that this measure gives an equivalent score to parsimony length. More formally:
Theorem 1. Let Xi be a character with ri states and let T be a fully resolved phylogenetic tree. It takes exactly
(Xi,T) – (ri – 1) SPR rearrangements to transform T into a tree compatible with Xi. The result still holds if Xi has some missing states.
As an example, consider the character X1 mapping taxa A,C,D,F to one and B,E,G to zero. The length of this character on tree T1 of Figure 1 is three, and the number of SPR rearrangements needed to transform T1 onto some tree T3 compatible with X1 is two. Note that there could be other trees compatible with X1 that are further than two SPR rearrangements away: the result only gives the number of rearrangements required to obtain the closest tree.
Once stated, the theorem is not too difficult to prove. First show that performing an SPR rearrangement decreases the length by at most one step. Hence it takes at least
(Xi,T) – (ri – 1) SPR rearrangements to transform T into a tree compatible with the character Xi. Then show that this is the minimum required. A formal proof is presented in Appendix 1. A restricted (binary character) version of this theorem was proved in Bryant (2003).
The theorem captures an issue that is central to the interpretation of incongruence: whether an observed incongruence is to be explained by positing homoplasy or by modifying the tree. Define the SPR distance from a tree T to a character Xi to be the SPR distance from T to the closest tree T' that is compatible with Xi. Theorem 1 then tells us that the SPR distance from T to a character Xi is equal to the difference between the length
(Xi,T) of Xi on T and the minimum possible length of Xi on any tree. That is, the amount that the length of a character exceeds the minimum possible length (ri – 1) is equal to the amount we have to modify T in order to remove all incongruence between T and Xi.
| Consensus Trees, Supertrees, and Parsimony |
|---|
|
|
|---|
In their insightful overview of supertree methods, Thorley and Wilkinson (2003) characterize a family of supertree methods that all minimize a sum of the form
|
| (1) |
Let ds(T,Xi) denote the SPR distance from T to the closest fully resolved tree Ti that is compatible with Xi. By Theorem 1, a maximum parsimony tree for X1,...,Xm is one that minimizes the expression
|
| (2) |
An SPR median tree for fully resolved trees t1,...,tn on the same leaf set is a tree T that minimizes
|
|
Now suppose that we have both characters and trees in the input. Both types of phylogenetic data can summarized by an SPR median tree T, chosen to minimize the sum
|
|
It is important to note the difference between this approach and the MRP method (Baum, 1992; Ragan, 1992), which could also be used to combine trees and characters. In MRP, the trees are broken down into multiple independent characters. This is a problem, because the characters encoding a tree are nowhere near independent. In contrast, the SPR median tree approach treats a tree as a single indivisible unit of information.
There is one critical issue that has been side-stepped: computation time. At present, computational limitations make the construction of SPR median trees infeasible for all but the smallest data sets. Computing the SPR distance between two trees is an NP-hard problem (Hickey et al., 2006). In contrast, total evidence and MRP approaches are possible for at least 100 taxa. However, there are now good heuristics for unrooted SPR distance (Goloboff, 2007) and exact special case algorithms (Hickey, et al., 2006) that could be applied to the problem. Below we describe a lower bound method for the SPR distance that should also aid construction of these SPR median trees.
| Parsimony on Pairs of Characters |
|---|
|
|
|---|
Another valuable application of Theorem 1 follows when we consider parsimony analysis of just two unordered and reversible characters. The concept of pairwise character compatibility was introduced by Le Quesne (1969); see also (Felsenstein (2004). Two binary undirected characters with states 0 and 1 are incompatible if and only if all four combinations of 00,01,10, and 11 are present as combination of states for the two characters. In a standard setting, character incompatibility is interpreted as implying that at least one of the characters has undergone convergent or recurrent mutation. In other words, for every possible phylogeny describing the history of the two characters, at least one homoplasy is posited for one of the characters (assuming the characters were correctly scored). Another interpretation of incompatibility of two characters is that characters evolved without homoplasy on two different phylogenies, where the phylogenies differ by one or more SPR rearrangement (Sneath et al., 1975; Hudson and Kaplan, 1985).
Define the total incongruence score i(X1,X2) for two multistate unordered characters X1 and X2 (with r1 and r2 states respectively) as
|
| (3) |
Theorem 2. The total incongruence score i(X1,X2) for two characters equals the minimum SPR distance between a tree T1 and a tree T2 such that T1 is compatible with X1 and T2 is compatible with X2.
Although the notion of total incongruence for two characters has been considered before in the context of character selection and weighting (Penny and Hendy, 1986), it has not been considered in the context of genealogical similarity. Essentially, Theorem 2 shows that the total incongruence score equals the minimum possible number of SPR rearrangements that could have occurred between the phylogenetic histories for both characters, assuming that the characters have different histories with which they are each perfectly compatible.
Indeed, Theorem 2 suggests a natural way to interpret genealogical similarity between two characters, which we have used to develop a powerful test for recombination (Bruen et al., 2006). Choosing two characters from two different genes (which have possibly different histories) gives a simple approach to identify the distinctiveness of the histories of the genes.
We can also apply Theorem 2 to obtain a lower bound on the SPR distance between two trees. Suppose that we have two trees T1 and T2 and we wish to obtain a lower bound on the SPR distance d(T1,T2) between the two trees. If we choose any character X1 compatible with T1 and any character X2 compatible with T2 then, by Theorem 2, we have that i(X1,X2)
d(T1,T2). By carefully choosing X1 and X2 we can obtain tighter bounds.
| Discussion and Extensions |
|---|
|
|
|---|
We have presented a reformulation of parsimony that is complementary or dual to the standard definitions. Instead of measuring how well a character fits onto a tree, we look at how well the tree fits onto the character. A consequence of this new perspective is that we can combine trees and character data using one general SPR framework, and we also obtain new results connecting incongruence measures and recombination. Nevertheless, it is not immediately clear how the new reformulation can be interpreted in itself.
Consider the information a single character, or tree, represents. Given a single character, we can define the set or cloud of trees around the character comprising exactly those trees compatible with the character (Fig. 2). If the character evolved without homoplasy, then the true evolutionary tree must be contained somewhere within the cloud. Now suppose we have multiple characters, each with its own cloud. There may not be a single tree contained in the intersection of all of these clouds. Instead, we search for a tree that is as close as possible to all of the clouds. The distance from T to the cloud associated to character Xi is exactly d(T,Xi), so by Theorem 1 a tree closest to all of the clouds is a maximum parsimony tree.
|
We note that several of the results in this article can be extended; for details, see Bruen (2006c). Both Theorems 1 and 2 remain valid if we replace the SPR distance with the tree bisection and reconnection (TBR) distance. In a TBR rearrangement, a subtree is removed from the tree and then reattached elsewhere with the tree, the difference with SPR being that with SPR we can only reattach the subtree at the node where the branch was broken, whereas with TBR we can reattach from any node in the subtree (Allen and Steel, 2001; Felsenstein, 2004). The TBR distance between two trees is the minimum number of TBR rearrangements required to transform one tree into the other.
That Theorems 1 and 2 hold for the TBR distance may appear surprising because the TBR distance between two trees is always less than, or equal to, the SPR distance between the trees. However, the extension follows by a small change to the proof of Theorem 1, noting that a TBR move can only reduce the parsimony score of a character by at most one.
We have also explored extensions of the result to other distances between trees, notably the Robinson-Foulds distance and the nearest neighbor interchange distance. See See Bruen (2006) for details.
| Appendix |
|---|
|
|
|---|
The first observation is that a TBR rearrangement of a tree increases the length of a character by at most one. As SPR rearrangements are a special case of TBR rearrangements, the same result holds for SPR.
Lemma 1. Let T be a fully resolved phylogenetic tree and Xi an unordered reversible character. Let T' be a phylogenetic tree that differs from T by a single TBR rearrangement. Then
(
,T')
(
,T) +1.
Proof The proof of Lemma 5.1 in Bryant (2004) for binary characters applies directly to the multistate case.
Let dSPR(T,T') denote the unrooted SPR distance between two phylogenetic trees T and T'.
Theorem 1 Let Xi be a character with ri states and let T be a fully resolved phylogenetic tree. It takes exactly
(Xi,T) – (ri – 1) SPR rearrangements to transform T into a tree compatible with Xi. The result still holds if Xi has some missing states.
Proof Let T' be a fully resolved phylogenetic tree compatible with Xi for which dSPR(T,T') is minimized and let m = dSPR(T,T'). Then there exists a sequence of trees T' = T0, ...,Tm = T such that every adjacent pair of trees in the sequence differs by exactly one SPR rearrangement. By Lemma 1 the existence of this sequence implies that
(T,Xi) –
(T',Xi)
dSPR(T,T') and because Xi is compatible with Xi we have
(T',Xi) = ri – 1, giving
|
|
For the other direction, we show that we can construct a sequence of
(T,Xi) – (ri – 1) SPR rearrangements that transform T into a tree T' compatible with Xi. Firstly, if
(T,Xi) – (ri–1) = 0, then T is compatible with Xi, so the proof is finished. Otherwise, let
i be an assignment of states to internal nodes that minimizes the number of state changes (that is, a minimum extension of Xi). Then because Xi is not compatible with T there exist three vertices u,v, and w, where {u,v}
E(T), v lies on the path from u to w and
i(u) =
i(w)
i(v). Perform an SPR rearrangement by removing edge {u,v}, supressing the v vertex and creating a new edge {u,t} where t is a new vertex on an edge adjacent to w. Furthermore, set
i(t) =
i(w). Then the number of edges on which a change has occurred has decreased by 1, thereby decreasing the parsimony length by 1. This procedure can be repeated until the parsimony length equals ri – 1, constructing the desired sequence of trees and completing the proof.
Theorem 2. The total incongruence score i(X1,X2) for two characters equals the minimum SPR distance between a tree T1 and T2 such that X1 is compatible with T1 and X2 is compatible with T2.
Proof. Let T1 and T2 be any two trees compatible with X1 and X2, respectively. Then
(X1,T1) = r1 – 1 and by Theorem 1,
(X2,T1) – (r2–1)
dSPR(T1,T2). We have then that
|
|
We show that this bound can be achieved. Let T be a maximum parsimony tree for the pair of characters X1,X2. By Theorem 1 there exist two trees T1 and T2 such that T1 is compatible with X1, T2 is compatible with X2 and
|
|
dSPR(T1, T) + dSPR(T2,T)
i(X1,X2) and hence |
|
| Acknowledgements |
|---|
We would like to thank Mike Steel, Sebastien Böcker, Olaf Bininda-Emonds, Pablo Golloboff, Melanie Pierson, Mark Wilkinson, and an anonymous referee for their valuable suggestions. This research was partially supported by the New Zealand Marsden Fund.
| References |
|---|
|
|
|---|
-
Allen B., Steel M. Subtree transfer operations and their induced metrics on evolutionary trees. Ann. Comb. (2001) 5:1–13.[CrossRef]
Baum B. Combining trees as a way of combining datasets for phylogenetic inference, and the desirability of combining gene trees. Taxon (1992) 41:3–10.[CrossRef][Web of Science]
Bruen T. Discrete and statistical approaches to genetics (2006) Montreal, Canada: McGill University School of Computer Science. PhD thesis.
Bruen T., Bryant D. A subdivision approach to maximum parsimony. Ann. Comb. (2008) 12:45–51.[CrossRef]
Bruen T., Philippe H., Bryant D. A simple and robust statistical test to detect the presence of recombination. Genetics (2006) 172:1–17.
Bryant D. Building trees, hunting for trees and comparing trees (1997) UK: Department Mathematics, University of Canterbury. PhD thesis.
Bryant D. A classification of consensus methods for phylogenetics. In: Bioconsensus—Janowitz M. F., Lapointe F.-J., McMorris F. R., Mirkin B., Roberts F. S., eds. (2003) Providence, Rhode Island: American Mathematical Society. 163–184. volume 61 of DIMACS.
Bryant D. The splits in the neighborhood of a tree. Ann. Combinatorics (2004) 8:1–11.
Chen D., Eulenstein O., Fernandez-Baca D., Sanderson M. Minimum-flip supertrees: Complexity and algorithms. IEEE/ACM Trans. Comput. Biol. Bioinformatics (2006) 3:165–173.[CrossRef]
Cotton J., Wilkinson M. Majority-rule supertrees. Syst. Biol. (2007) 56:445–452.
Farris J. S., Källersjö M., Kluge A. G., Bult C. Constructing a significance test for incongruence. Syst. Biol. (1995) 44:570–572.
Felsenstein J. Inferring phylogenies (2004) Sunderland, Massachusetts: Sinauer Associates.
Fitch W. M. Towards defining the course of evolution: Minimum change for a specific tree topology. Syst. Zool. (1971) 20:406–416.[Abstract]
Goloboff P. Calculating SPR distances between trees. Cladistics (2007) 23:1–7.[CrossRef][Web of Science]
Gordon A. D. Consensus supertrees: The synthesis of rooted trees containing overlapping sets of labeled leaves. J. Class. (1986) 3:335–348.[CrossRef]
Hein J. Reconstructing evolution of sequences subject to recombination using parsimony. Math. Biosci. (1990) 98:185–200.[CrossRef][Web of Science][Medline]
Hickey G., Dehne F., Rau-Chaplin A., Blouin C. The computational complexity of the unrooted subtree prune and regraft distance (2006) Faculty of Computer Science, Dalhousie University. Tech. Rep. CS-2006-06.
Hill T. Development of new methods for inferring and evaluating phylogenetic trees (2007) Uppsala, Sweden: Uppsala Universitet. PhD thesis.
Hudson R. R., Kaplan N. L. Statistical properties of the number of recombination events in the history of a sample of DNA sequences. Genetics (1985) 111:147–164.
Lapointe F.-J., Cucumel G. The average consensus procedure: Combination of weighted taxa containing identical or overlapping sets of taxa. Syst. Biol. (1997) 46:306–312.
Le Quesne W. J. A method of selection of characters in numerical taxonomy. Syst. Zool. (1969) 18:201–205.
Penny D., Hendy M. Estimating the reliability of evolutionary trees. Mol. Biol. Evol. (1986) 3:403–417.[Abstract]
Ragan M. A. Phylogenetic inference based on matrix representations of trees. Mol. Phyl. Evol. (1992) 1:53–58.[CrossRef][Medline]
Sneath P., Sackin M., Ambler R. Detecting evolutionary incompatibilities from protein sequences. Systematic Zoology (1975) 24:311–332.
Song Y. S. On the combinatorics of rooted binary phylogenetic trees. Ann. Comb. (2003) 7:365–379.[CrossRef]
Swofford D., Olsen G., Waddell P., Hillis D. Molecular sytematics. In: Phylogenetic inference (1996) Sunderland, Massachusetts: Sinauer Associates. 407–514.
Swofford D. L. PAUP*. Phylogenetic analysis using parsimony (*and other methods) (1998) Sunderland, Massachusetts: Sinauer Associates.
Thorley J. L., Wilkinson M. A view of supertree methods. In: Bioconsensus—Janowitz M. F., Lapointe F.-J., McMorris F. R., Mirkin B., Roberts F. S., eds. (2003) New York: The American Mathematical Society. 185–194. volume 61 of DIMACS, Series in Discrete Mathematics and Theoretical Computer Science.
Zwickl D. Genetic algorithm approaches for the phylogenetic analysis of large biological sequence datasets under the maximum likelihood criterion (2006) University of Texas at Austin. PhD thesis.
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||



