© 2008 Society of Systematic Biologists
Perfectly Misleading Distances from Ternary Characters
Edited by Paul Lewis
1 Department of Mathematics, University of Hamburg Bundesstr. 55, 20146 Hamburg, Germany; E-mail: bandelt{at}math.uni-hamburg.de
2 Biomathematics Research Centre, University of Canterbury Private Bag 4800, Christchurch, New Zealand; E-mail: mfi28{at}student.canterbury.ac.nz
| Abstract |
|---|
|
|
|---|
D. Huson and M. Steel showed that for any two binary phylogenetic trees on the same set of n taxa, there exists a sequence of multistate characters that is homoplasy-free only on the first tree but perfectly additive only on the second one. The original construction of such a sequence required n – 1 character states and it remained an open question whether a sequence using fewer character states can always be found. In the present note we will answer this question by showing that three character states suffice to construct such misleading sequences—even if we insist that they conform to an ultrametric (i.e., fit a molecular clock).
Keywords: Disconcordant trees; noninformative ternary characters; perfect phylogeny; ultrametric
Received November 7, 2007; Revised January 15, 2008; Accepted April 2, 2008
For the estimation of phylogenetic trees, large DNA data sets are often transformed into distance matrices, to which distance methods are then applied. This transformation, however, inevitably leads to some loss of information but, more seriously, distances can be positively misleading in extreme cases. Huson and Steel (2004) have constructed sequences yielding a unique most parsimonious tree that is totally different from the tree univocally supported by the corresponding distances. Their construction required n–1 character states for n taxa and therefore cannot be realized with DNA sequences whenever n>5. We will show here that no more than three states are actually needed; that is, binary and ternary characters suffice for generating the extreme contrast between character- and distance-based trees.
Recall that a character on a taxon set X is said to be informative (with respect to parsimony) if at least two distinct character states occur more than once in X. Then a noninformative ternary character f has three character states,
, β,
, for which
and β are attained in X exactly once: f(a)=
, f(b)=β, and f(x)=
for all taxa a, b, and x different from a and b. Such a character on X featuring the two taxa a, b thus induces a "trivial" partition of X into two singletons and a remainder, for which we use the shorthand a|b|X–a, b. Similarly, the "trivial" partition (alias trivial split) of X associated with a noninformative binary character distinguishing taxon a is denoted by a|X–a.
Noninformative characters are most useful for our purposes because they can yield some signal for distance-based methods but do not discriminate between trees under the parsimony criterion. Specifically, not only can noninformative ternary characters jointly generate any nontrivial split metric (along with some trivial split metrics), they can also totally cancel any nontrivial split metric too. We can thus freely navigate between perfectly additive distances by using only noninformative ternary characters.
| Results |
|---|
|
|
|---|
We begin with some notation. In the following, we will assume that X is a set of n
4 taxa. Partitions of X into two parts are also called splits; partitions/splits of X are said to be trivial exactly when they are induced by noninformative characters on X. A binary phylogenetic tree on X has n leaves (end nodes) that are labeled by the n taxa in X and n–2 interior nodes that are unlabeled. From any finite sequence f=f1f2...fk of characters on X (as, e.g., obtained from aligned DNA sequences), one derives the mismatch (Hamming) distance function d=df by |
|
The mismatch distances derived from a partition with k parts A1, ..., Ak cannot be distinguished from the sum of the mismatch distances derived from the half-weighted splits Ai |X–Ai (i=1, ..., k):
|
| (1) |
In particular, if f is a noninformative ternary character featuring the two taxa a, b, then the derived mismatch distance df =da|b|X–a,b equals the sum of the three mismatch distances derived from the half-weighted splits a|X–a, b|X–b, and a, b|X–a, b:
|
| (2) |
|
Linear dependence of the split metrics on X leads to several equations; for instance, the most fundamental one equates the sum of all split metrics where one part has cardinality exactly 2 with a multiple of the sum of all trivial split metrics:
|
| (3) |
This formula is straightforward to verify by evaluating either side on a pair x, y of different taxa, yielding 2(n – 2) on either side.
From Equation (3) and the following trivial Equation (4), one immediately obtains the subsequent Equation (5):
|
| (4) |
|
| (5) |
The latter equation is then used to express the split metric dA|B as a linear combination of split metrics where one split part has cardinality at most 2.
Lemma 1 The split metric dA|B for a nontrivial split A|B of an n-taxon set X satisfies the following two equations:
|
| (6) |
|
| (7) |
Proof. To prove (6), one needs to evaluate either side on a pair x, y where x is from A, say, and y is either from A – x or from B. The result on the right side is
in the former case and equal to
in the latter case, as required. From (6) one readily derives (7) by substituting the sum of all
by the right-hand side of (5).
This lemma and the preceding equations constitute the ingredients for the key lemma, which describes how one can either cancel or create a single split metric by means of metrics derived from noninformative ternary characters:
Lemma 2 The split metric dA|B for a nontrivial split A|B of an n-taxon set X can, up to a sum of trivial split metrics, either be cancelled via
|
| (8) |
|
| (9) |
Proof. Using the preceding lemmas, we compute
|
|
|
|
With these prerequisites we can readily prove the following theorem:
Theorem For any two distinct binary phylogenetic trees T1 and T2 on the same taxon set X, there exists a sequence g of binary and ternary characters for which
- g is homoplasy-free for T1 and for no other phylogenetic tree on X,
- the distance function dg derived from g is an ultrametric that is perfectly additive on T2 and on no other phylogenetic tree on X.
Proof. Let f be a sequence of k binary characters that induces the system
*(T1) of the k nontrivial splits of the first tree, which correspond to the k = # X – 3 interior edges of T1. Trivially, f is homoplasy-free on T1 and on no other tree. By adding suitable noninformative ternary characters, we will cancel every split in
*(T1) with respect to the resulting mismatch distances and then create all members of
*(T2), that is, all nontrivial splits of T2, metrically and finally take care of ultrametricity by adding the necessary noninformative binary characters.
Specifically, for every split A|B from
*(T1) and every pair a
A, b
B, we take a ternary character featuring the pair a, b; that is, inducing the partition a|b|X–a, b. Then according to Lemma 2, Equation (8), the resulting expanded sequence f* yields pairwise distances conforming to a bush (alias star) metric. Next, for each split A|B from
*(T2) and all pairs a, a' from A and b, b' from B, we take noninformative ternary characters featuring those pairs a, a' and b, b'. Then, in view of Lemma 2, Equation (9), the mismatch distances with respect to the thus expanded sequence f** of characters are perfectly additive on
*(T2) but no other phylogenetic tree on X.
Finally, in order to obtain an ultrametric, we first select a point r at distance either 0 or 1/2 to the midpoint on a longest path of T2, which has integer distances to the taxa (labeling the leaves ofT2) relative to df**. Let µ be the maximum distance from r to a taxon in X. Then, for every x
X, add µ – df**(r,x) many noninformative binary characters featuring x to the current sequence f**. Then the final sequence g delivers an ultrametric dg and still supports T2 and only T2, as required. This finishes the proof of the theorem.
| Illustration |
|---|
|
|
|---|
In Figure 2, we display a five-taxon instance with three compatible binary and 20 noninformative ternary characters that illustrates the character sequence g constructed in the proof of the theorem. The first two characters induce the compatible splits 1,2|3,4,5 and 1,2,3|4,5. The next five ternary characters cancel the former split, whereas the subsequent five cancel the latter split metrically. Then the next block of four ternary characters establishes the novel distance split 1,3|2,4,5, whereas the subsequent block erects 1,3,4|2,5. The final binary character helps to make the distances ultrametric. These data support the perfect phylogeny shown in Figure 3a, where the parsimony scores are indicated along the links. The ultrametric tree shown in Figure 3b (where the lengths of links are indicated) represents the corresponding mismatch distances.
|
|
| Conclusion |
|---|
|
|
|---|
In regard to the number of character states, the theorem is best possible; that is, compatible binary characters alone would not generate the contrast between parsimony-based and distance-based trees. This is due to the fact that the splits induced by these characters would be faithfully reconstructed from the derived distance function [Proposition 7.1.9].
Although ternary characters come up naturally with DNA data, the theorem is of rather theoretical nature because such an amount of ternary characters necessary for split switching is out of reach with natural data, especially with mitochondrial DNA data sets, where ternary characters are sparse. Nonetheless, the presence of ternary characters with their different contributions to parsimony scores and distance values can yield a minor effect that would come on top of other noise patterns, which could adversely influence the reconstruction of distance trees and their interpretation as phylogenetic trees.
Note that correcting for hidden changes under some model of sequence evolution would not eliminate the problem. Indeed, adding sufficiently many constant characters (unvaried sequence positions) to the data would let the corrected distances come arbitrarily close to the uncorrected mismatch distances. Standard distance methods for estimating a tree from such data would then still approximate the unique binary phylogenetic tree on which the data are nearly additive (e.g., Atteson, 1999; Semple and Steel, 2003, Theorem 7.3.5 Semple and Steel, 2004, Theorem 4.3; Mihaescu et al., 2007).
| References |
|---|
|
|
|---|
-
Atteson K. The performance of neighbor-joining methods of phylogenetic reconstruction. Algorithmica (1999) 25:251–278.[CrossRef][Web of Science]
Huson D., Steel M. Distances that perfectly mislead. Syst. Biol. (2004) 53:327–332.
Mihaescu R., Levy D., Pachter L. Why neighbor-joining works. Algorithmica (2007) DOI 10.1007/s00453-007-9116-4.
Semple C., Steel M. Phylogenetics. In: Oxford Lecture Series in Mathematics and Its Applications (2003) Oxford, UK: Oxford University Press.
Semple C., Steel M. Cyclic permutations and evolutionary trees. Adv. Appl. Math. (2004) 32:669–680.[CrossRef]
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||










