| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
© 2008 Society of Systematic Biologists
More Taxa Are Not Necessarily Better for the Reconstruction of Ancestral Character States
Edited by Todd Oakley
1 Department of Computer Science, National University of Singapore Singapore, 117590; E-mail: ligl{at}comp.nus.edu.sg
2 Biomathematics Research Centre, University of Canterbury Christchurch, New Zealand; E-mail: M.Steel{at}math.canterbury.ac.nz
3 Department of Mathematics, National University of Singapore Singapore, 117543; E-mail: matzlx{at}nus.edu.sg
Received November 17, 2007; Revised February 11, 2008; Accepted April 7, 2008 Ancestral state reconstruction is an important approach to understanding the origins and evolution of key features of different living organisms (Liberles, 2007). For example, ancestral proteins and genomic sequences have been reconstructed for investigating the origins of genes and proteins (Hillis et al., 1994; Jermann et al., 1995; Zhang and Rosenberg, 2002; Gaucher et al., 2003; Thornton et al., 2003; Blanchette et al., 2004; Cai et al., 2004; Felsenstein, 2004; Taubenberger et al., 2005). A variety of reconstruction methods, including parsimony and maximum likelihood, exist for biomolecular sequencing (Yang et al., 1995; Koshi and Goldstein, 1996; Elias and Tuller, 2007), multistate discrete data (Schultz et al., 1996; Mooers and Schluter, 1999; Pagel, 1999), and continuous data (Martins, 1999). These different reconstruction methods have been assessed by both theoretical analyses (Maddison, 1995; Yang et al., 1995) and computer simulation (Schultz et al., 1996; Zhang and Nei, 1997; Salisbury and Kim, 2001; Blanchette et al., 2004; Mooers, 2004; Williams et al., 2006). One important observation in these investigations is that the topology of the phylogenetic tree relating the extant taxa to the target ancestor has a significant influence on reconstruction accuracy. For instance, a star-like phylogeny allows the ancestral character states to be inferred more accurately than other topologies given the same number of terminal taxa under the two-state symmetric model (Schultz et al., 1996; Evans et al., 2000). For more complex models (e.g., on four-states such as DNA), the influence of topology on reconstruction accuracy is more complicated (Lucena and Haussler, 2005).
| Models and Methods |
|---|
|
|
|---|
We study how ancestral state reconstruction depends on taxon sampling, with the assumption that the true phylogenetic tree is given. Intuitively, more terminal taxa should give better reconstruction accuracy. For example, in a recent review, Crisp and Cook (2005:127) recommend that "if ancestral features are to be inferred from a phylogeny, a method that optimizes character states over the whole tree should be used." In certain cases, this viewpoint can be formally justified; for example, consider the problem of estimating the root state given a character at the leaves of a tree, under a model of character evolution in which the branch lengths are known, and each of the states at the root has equal prior probability. In this case, the "most accurate" method for reconstructing the root state is to use a local (or marginal) maximum likelihood (ML) method, applying it to the total set of taxa (not just a subset). Before we justify this claim, recall that for estimating the root state, the local ML procedure simply selects the state that has the highest probability of evolving the given characters under the model (with the branch lengths specified), and any ties are broken uniformly (Koshi and Goldstein, 1996; Schluter et al, 1997; Felsenstein, 2004). By "most accurate," we mean the method that has the highest expected probability of returning the correct root state. The proof of the claimed optimality of ML in this setting can be found in Berger (1985:159) or Steel and Szekely (1999:Theorem 4). For models in which certain states may have higher a priori probability at the root, the most accurate reconstruction method is to maximize the posterior probability of the root state (i.e., the product of the ML score for each root state with its a priori probability). In summary, if the branch lengths and model are known, it is always best to use all the terminal taxa and to do so in an ML-style framework (which can provide a different estimate from that provided by maximum parsimony).
However, if the branch lengths that describe the evolution of the character are not known, the situation is more complicated. For simple models, such as the symmetric Poisson model (e.g., the Jukes-Cantor model on four states), it is known that the ML estimate of the root state (where one also optimizes the branch lengths as "nuisance" parameters) is identical to the maximum parsimony estimate (see Tuffley and Steel, 1997:Theorem 6).
This leads to a natural question regarding the situation where the branch lengths for the character are unknown to the investigator—is the maximum parsimony estimation of the root state using all the leaf species more accurate than just using a subset of leaf species? We will see that for certain trees, it may be better to use some leaves—or even a single leaf—that are "near" the root for estimating the root state.
Given a phylogenetic tree of a group of taxa, we assume that the character evolves by a Markov process, starting with a state at the root and proceeding to the leaves. The evolutionary model specifies the length of each branch or, equivalently, the probability that a state c evolves to a state d on a branch from node f to node x as conditional probability Pr[sx = d|sf = c].
In this study, we also assume that (i) there are only two states, say 0 and 1; and (ii) there is a symmetric rate of change between the two states 0 and 1, or, equivalently, both types of substitution change are equally likely on a given branch. We call the probability Pr[sx = c|sf = c] the conservation probability on that branch, denoted p (by the symmetry assumption, p is the same for c = 0 or c = 1). Throughout this article, the accuracy of a reconstruction method is represented as an increasing function of the conservation probability rather than a decreasing function of the probability of change.
We analyze reconstruction accuracy in two evolutionary models: the equal branch length model (Oakley et al., 2005; Lee et al., 2006), which assumes that change happens mostly at speciation events (vertices) and therefore the length of each branch is irrelevant, and the distance model (Oakley et al., 2005; Lee et al., 2006), which assumes that mutations occurred continuously during the course of evolution and hence the branch length is no longer a constant. Both models have their advantages and disadvantages in ancestral state reconstruction (reviewed in Cunningham, 1999).
Under the symmetric evolutionary model given above, the ancestral state at the root of the phylogenetic tree begins with a character state and evolves with probability 1 – p of change on the branches of the tree; hence, the extant taxa would receive one of many possible distributions of character states 0 and 1. These states in extant taxa are the data used by the Fitch method (Fitch, 1971) to reconstruct the most parsimonious character state at the root as follows. This method assigns a set of states to each node one by one downward in the tree, starting with the leaves and using the subsets previously computed for the node's descendants. For each leaf node, the observed state forms the state subset. For the internal nodes, the following rule is used. Assume A is an internal node with descendants B and C. The state subset SA is calculated from the state subsets SB and SC by:
|
|
Because we are concentrating on state reconstruction in a symmetric evolutionary model with two states 0 and 1, the reconstruction's accuracy is independent of the prior distribution of the states at the root. Hence, for the Fitch method, the unambiguous (reconstruction) accuracy is
|
|
The ambiguous (reconstruction) accuracy is
|
|
|
|
| Results |
|---|
|
|
|---|
In this section, we present several examples to show that the accuracy of the Fitch method for reconstructing ancestral character states at the root from all terminal states in a phylogenetic tree can be smaller than the conservation probability on a path from the root to a nearest leaf. The readers are referred to the online supplementary document (www.math.nus.edu.sg/~matzlx/papers/SysBiolSupplementary.pdf) for details.
We first consider the reconstruction accuracy of the Fitch method on the complete binary phylogenetic tree on 2n taxa in the equal branch length model. As n tends towards infinity, the unambiguous accuracy UAn (p) converges to 1/3 when 1/2
p
7 7/8 and to
|
|
7 /8 as shown in Steel (1989). When n approaches infinity, the conservation probability on any root-to-leaf path converges to 1/2. Therefore, when 1/2
p
7/8, the unambiguous reconstruction accuracy of using all the terminal taxa is smaller than 1/2, the limit of the conservation probability. As n approaches infinity, the reconstruction accuracy RAn (p) of using all the terminal taxa converges to 1/2 when p
7/8 and to |
|
7/8 as shown in Steel (1989). Hence, on a large, complete binary phylogenetic tree, the conservation probability on a root-to-leaf path is larger than the unambiguous accuracy when p is small but smaller than the reconstruction accuracy of using all the terminal taxa (Fig. 1).
|
Next, we consider the comb-shaped tree with n leaves as shown in Figure 2. Note that in the equal branch length model, a descendant leaf of the root is closer in evolutionary distance to the root than other leaves in a larger clade. The unambiguous accuracy UAn (p) of using all the terminal taxa in the tree is
|
|
indicates the roots of the characteristic equation: |
|
1 |,|
2 | < 1 for 0 < p < 1, UAn (p) converges to p(2 – 3p + 2p2)/2 – 3p2 + 2p3 as n tends towards infinity. The limit can be easily shown to be less than p. Similarly, the reconstruction accuracy RAn (p) converges to 1 + 2p – 3p2 + 2p3/2(2 – 3p2 + 2p3). As shown in Figure 2, the limit of the reconstruction accuracy is also smaller than p, the conservation probability on the branch leading to the descendant leaf of the root.
|
The observation on the comb-shaped trees applies to any asymmetric phylogenetic trees T in which a descendant leaf of the root A is on a branch that is shorter than the branch leading to a large clade, as illustrated in Figure 3. We now establish this result under a model in which the branch length is not constant. Let Y be the descendant leaf and Z the other descendant of A. We assume that the conservation probability on the branches leading to Y and Z are p1 and p2, respectively, and set
|
|
|
|
|
|
|
|
|
The accuracy of the Fitch method for reconstructing the root state is
|
|
p2 > 1/2, the reconstruction accuracyRA is less than p1 (because β > 0 and, in general,
> 0). This shows that reconstructing the root state from all the leaf states in T using maximum parsimony is less accurate than using just the state of a leaf adjacent to the root, whenever the branch leading to this leaf is not longer than the branch leading to the clade.
More interestingly, the reconstruction accuracy of the local or marginal ML method is just equal to p1 even with multiple states. For simplicity, we show this only for two-state models as follows. When we say D is a state configuration of the terminal taxa, we mean that D contains a state for each terminal taxon in T. For a root state c and a state configuration D of the terminal taxa, we use P(D| c) to denote the probability that c evolves into the states specified by D at the leaves. For any state configuration DZ of the terminal taxa below the node Z and s = 0, 1, the term sDZ denotes the state configuration of all the terminal taxa in which Y receives state s and other taxa receive the states specified by DZ. For any DZ, we have
|
|
|
|
|
|
p2 > 1/2, we have P(0DZ | 0) > P(0DZ | 1) (since, in general, P(DZ | 0) > 0 and P(DZ | 1) > 0). This implies that the local ML method correctly infers 0 as the root state with the probability |
|
Notice that, in the two arguments we have presented above (for parsimony and marginal ML), we have imposed no assumption concerning the conservation probabilities on branches within the tree, other than (i) p1
p2 > 1/2 and (ii) the other conservation probabilities are non-degenerate (so
> 0, P(DZ | 0) > 0 and P(DZ | 1) > 0).
We have shown that the accuracy of the Fitch method for reconstructing ancestral character states at the root from all terminal states in a phylogenetic tree can be smaller than the conservation probability on a path from the root to a nearest leaf. To find out how often this happens, we conducted a computer simulation test. We generated random phylogenetic trees using the Yule model. The generation procedure starts with a single root node. In each step, the procedure randomly selects one leaf with uniform distribution from the current tree and adds two descendants to it. The process terminates when the generated phylogeny has the required number of leaves.
For each random phylogenetic tree, we calculated and compared the conservation probability on the shortest root-to-leaf path, the conservation probability on the longest root-to-leaf path, and the accuracy of reconstructing the ancestral state at the root from all the leaf states. We assumed that all branches had the same length and that the conservation probability is p. For N = 9, 15, 20 and p = 0.5 + 0.01i, 0
i
49, we generated 5000 random phylogenetic trees with N leaves and the conservation probability p on each branch. The left panel of Figure 4 gives the proportion of generated phylogenetic trees in which the conservation probability on the shortest root-to-leaf path is larger than the accuracy of reconstructing the ancestral root state from all the leaf states with the Fitch method. When p is in the range of 0.5 and 0.8, the conservation probability on the shortest root-to-leaf path is larger than the accuracy of reconstructing the correct state at the root in a large portion of trees. When p exceeds 0.83, the number of "bad" trees decreases rapidly. The right panel of Figure 4 shows the reconstruction accuracy of these three different reconstructions from some sampled trees. It is well known that the Yule model tends to produce trees that are, on average, more balanced than most real reconstructed trees (Aldous, 2001; Blum and François, 2006) and so we expect the level of support for the accuracy of root state reconstruction using a single species to be higher on real trees.
|
In general, the reconstruction accuracy of using a subset of the terminal taxa can also be higher than that obtained by using all the terminal taxa. For example, for the phylogenetic tree given in Figure 5 in which the conservation probability in each branch is 0.71, the reconstruction accuracy is 0.5878 if all the leaf states are used and 0.5916 if the states of only the four closest leaves indicated in the figure are used. This is true as long as the conservation probability is in the range from 0.5 to 0.82.
|
| Discussion |
|---|
|
|
|---|
In studying how the accuracy of ancestral state reconstruction depends on taxon sampling, we demonstrated that more taxa are not necessarily better for ancestral state reconstruction with the Fitch method under the assumption that the true phylogenetic tree was given. This also happens with the maximum likelihood method. Our results and analyses have several implications.
First, taxon sampling has a subtle effect on the accuracy of ancestral state reconstruction. Unambiguous and ambiguous accuracy are considered separately by Salisbury and Kim (2001). Our analyses indicate that unambiguous and ambiguous accuracy first decrease and then increase with the number of taxa sampled in reconstructing the root state in a phylogenetic tree. This pattern of increased accuracy with a large data set of sampled taxa is consistent with the simulation results in Salisbury and Kim (2001). In our article, we define the reconstruction accuracy to be the unambiguous accuracy plus half of the ambiguous accuracy for the Fitch method. The reconstruction accuracy is much more sensitive to the tree structure and does not monotonically depend on the size of taxon sampling, especially when the given phylogenetic tree is asymmetric. As a result, researchers may need to decide how to select taxa from the observable extant species in reconstructing the root state of a clade. In certain cases, a single extant taxon at the end of a slowly evolving lineage (basal to the root) may provide a more accurate estimate of the root state than a tree-based analysis involving all the taxa (in contrast to Crisp and Cook, 2005).
Secondly, reconstruction methods attempt to incorporate the tree structure into ancestral state reconstruction. Suppose, for example, that 88 lineages formed a very recently diversified clade with a very long stem and a shorter single sister lineage. If the 88 lineages have state 1, but the sister lineage has state 0, which state should their common ancestor have? This is a typical situation when both a fossil record and extant data are used for ancestor state reconstruction (e.g., evolution of body size in the Caniformia in Finarelli and Flynn, 2006). Our analysis concludes that the local ML method selects 0 as the root state. Hence, when a fossil record is used, it is very likely for the reconstructed ancestor to take the fossil state if the local ML method is used.
Moreover, even for maximum parsimony, the estimated root state is not necessarily the same as the most frequently observed state at the leaves. In particular, the estimated root state is not necessarily the same as the most frequently observed state at the leaves. This is obvious if the tree is highly unbalanced, but it is also true for a perfectly balanced tree with at least 32 leaves (i.e., a complete balanced tree with 2h leaves, all the same distance from the root, and with h
5). For these trees, there exist binary characters for which maximum parsimony will assign the root state 0 even when the proportion of leaves in state 0 is less than
(indeed, this proportion can be made as close to 0 as desired by taking h sufficiently large)—for details see Theorem 2 of Charleston and Steel (1995). Thus, the structure of the tree and the way the character states are distributed at the leaves are both important in assigning an optimal root state, in contrast to a simpler "majority rule" approach to root state estimation.
Finally, we have derived counterintuitive phenomena under a particularly simple evolutionary model, namely, the two-state symmetric model where branch lengths can be equal or unequal. A natural question that empiricists may ask is how often this counterintuitive situation arises in practice. To take one example, for certain data, branch lengths might be expected to satisfy a molecular clock—this amounts to allowing each edge e to have its own conservation probability p(e) but requiring that the sum of –log [2p(e) – 1] is to be constant on each root-to-leaf path. A phylogenetic tree with this clock constraint is said to be ultrametric. For an ultrametric tree, it might be expected that using all the taxa results in more accurate root state estimation than using a subset of the taxa only. The simulations of Salisbury and Kim (2001) and Zhang and Nei (1997) suggested that this is often the case. Salisbury and Kim (2001) investigated how the accuracy of reconstructing root states responds to size changes in taxon sampling in an ultrametric phylogeny generated in a Yule model. Their results indicate that reconstruction accuracy is generally increased by using more taxa.
However, once again, we find that this general trend is not universally valid. More precisely, our simulation test shows that even with an ultrametric phylogenetic tree, the Fitch method or the joint ML method using a particular subset of terminal taxa can be more accurate (or at least as accurate) for ancestral state reconstruction than using all terminal taxa. We also observed that this holds for the four-state symmetric model. A related phenomenon was shown by Mossel (2001) for a certain asymmetric model using information-theoretic methods. In summary, the phenomena we have described are not restricted to trees that have a highly unusual set of (non–clock-like) branch lengths.
Despite this further counterintuitive result, we end by offering the following positive conjecture: for any ultrametric phylogenetic tree and a symmetric model, the Fitch parsimony method using all terminal taxa is more accurate (or at least as accurate) for ancestral state reconstruction than using any particular terminal taxon. Note that all root-to-leaf paths have the same conservation probability under a clock model.
| Acknowledgment |
|---|
|
|
|---|
The authors thank the anonymous reviewers and Todd Oakley for pointing out related references and helpful comments on the relevance of the results to applying the fossil records to ancestral state reconstruction. L. X. Zhang gratefully acknowledges an NUS ARF grant and NSFChina3052802 for partially supporting this project. G. L. Li was partially supported by the NUS President's Graduate Fellowship. L. X. Zhang also thanks Webb Miller for stimulating this research by pointing out the paper by Lucena and Haussler to him.
| References |
|---|
|
|
|---|
-
Aldous D. Stochastic models and descriptive statistics for phylogenetic trees, from Yule to today. Stat. Sci. (2001) 16:23–34.[CrossRef][Web of Science]
Berger J. O. Statistical decision theory and Bayesian analysis (1985) 2nd edition. Berlin: Springer-Verlag. Springer Series in Statistics.
Blanchette M., Green E. D., Miller W., Haussler D. Reconstructing large regions of an ancestral mammalian genome in silico. Genome Res. (2004) 14:2412–2423.
Blum M. G. B., Francois O. Which random processes describe the Tree of Life? A large-scale study of phylogenetic tree imbalance. Syst. Biol. (2006) 55:685–691.
Cai W., Pei J., Grishin N. V. Reconstruction of ancestral protein sequences and its applications. BMC Evol. Biol. (2004) 4:e33.[CrossRef]
Charleston M., Steel M. A. Five surprising properties of parsimoniously colored trees. Bull. Math. Biol. (1995) 57:367–375.[Web of Science]
Crisp M. D., Cook L. G. Do early branching lineages signify ancestral traits? Trends in Ecol. Evol. (2005) 20:122–128.
Cunningham C. W. Some limitations of ancestral character-state reconstruction when testing evolutionary hypotheses. Syst. Biol. (1999) 48:665–674.
Elias I., Tuller T. Reconstruction of ancestral genomic sequences using likelihood. J. Comput. Biol. (2007) 14:216–237.[CrossRef][Web of Science][Medline]
Evans W., Kenyon C., Peres Y., Schulman L. J. Broadcasting on trees and the Ising model. Ann. Appl. Prob. (2000) 10:410–433.[CrossRef]
Felsenstein J. Inferring phylogenies (2004) Sunderland, Massachusetts: Sinauer Associates.
Finarelli J. A., Flynn J. J. Ancestral state reconstruction of body size in the Caniformia (Carnivora, Mammalia): The effects of incorporating data from the fossil record. Syst. Biol. (2006) 55:301–313.
Fitch W. M. Toward defining the course of evolution: Minimum change for a specific tree topology. Syst. Zool. (1971) 20:406–416.[Abstract]
Gaucher E. A., Thomson J. M., Burgan M. F., Benner S. A. Inferring the palaeoenvironment of ancient bacteria on the basis of resurrected proteins. Nature (2003) 425:285–288.[CrossRef]
Hillis D. M., Huelsenbeck J. P., Cunningham C. W. Application and accuracy of molecular phylogenies. Science (1994) 264:671–677.
Jermann T. M., Opitz J. G., Stackhouse J., Benner S. A. Reconstructing the evolutionary history of the artiodactyl ribonuclease superfamily. Nature (1995) 374:57–59.[CrossRef][Medline]
Koshi J. M., Golstein R. A. Probabilistic reconstruction of ancestral protein sequences. J. Mol. Evol. (1996) 42:313–321.[CrossRef][Web of Science][Medline]
Lee C., Blay S., Mooers A. Ø., Singh A., Oakley T. H. CoMET: A Mesquite package for comparing models of continuous character evolution on phylogenies. Evol. Bioinformatics Online (2006) 2:193–196.
Liberles D. A., ed. Ancestral sequence reconstruction (2007) New York: Oxford University Press.
Lucena B., Haussler D. Counterexample to a claim about the reconstruction of ancestral character states. Syst. Biol. (2005) 54:693–695.
Maddison W. P. Calculating the probability distributions of ancestral states reconstructed by parsimony on phylogenetic trees. Syst. Biol. (1995) 44:474–481.
Martins E. P. Estimating of ancestral states of continuous characters: A computer simulation study. Syst. Biol. (1999) 48:642–650.
Mooers A. Ø. Effects of tree shape on the accuracy of maximum likelihood-based ancestor reconstruction. Syst. Biol. (2004) 53:809–814.
Mooers A. Ø., Schluter D. Reconstructing ancestor states with maximum likelihood: Support for one-and two-rate models. Syst. Biol. (1999) 48:623–633.
Mossel E. Reconstruction on trees: Beating the second eigenvalue. Ann. Appl. Prob. (2001) 11:285–300.[CrossRef]
Oakley T. H., Gu Z., Abouheif E., Patel N. H., Li W.-H. Comparative methods for the analysis of gene-expression evolution: An example using yeast functional genomic data. Mol. Biol. Evol. (2005) 22:40–50.
Pagel M. The maximum likelihood approach to reconstructing ancestral character states of discrete characters on phylogenies. Syst. Biol. (1999) 48:612–622.
Salisbury B. A., Kim J. Ancestral state estimation and taxon sampling density, Syst. Biol. (2001) 50:557–564.
Schluter D., Price T., Mooers A. Ø., Ludwig D. Likelihood of ancestral states in adaptive radiation. Evolution (1997) 51:1699–1711.[CrossRef][Web of Science]
Schultz T. R., Cocroft R. B., Churchill G. A. The reconstruction of ancestral character states. Evolution (1996) 50:504–511.[CrossRef][Web of Science]
Steel M. Distribution in bicolored evolutionary trees (1989) New Zealand: Massey University. Ph.D. thesis.
Steel M. A., Székely L. A. Inverting random functions. Ann. Combin. (1999) 3:103–113.[CrossRef]
Tauberberger J. K., Reid A. H., Lourens R. M., Wang R., Jin G., Fanning T. G. Characterization of the 1918 influenza virus polymerase genes. Nature (2005) 437:889–893.[CrossRef][Medline]
Thornton J. W., Need E., Crews D. Resurrecting the ancestral steroid receptor: Ancient origin of estrogen signaling. Science (2003) 301:1714–1717.
Tuffley C., Steel M. Links between maximum likelihood and maximum parsimony under a simple model of site substitution. Bull. Math. Biol. (1997) 59:581–607.[Web of Science][Medline]
Williams P. D., Pollock D. D., Blackburne B. P., Goldstein R. A. Assessing the accuracy of ancestral protein reconstruction methods. PLoS Comput. Biol. (2006) 2:598–605.[Web of Science]
Yang Z., Kumar S., Nei M. A new method of inference of ancestral nucleotide and amino acid sequences. Genetics (1995) 141:641–1650.
Zhang J., Nei M. Accuracies of ancestral amino acid sequences inferred by parsimony, likelihood, and distance methods. J. Mol. Evol. (1997) 44:139–146.[CrossRef]
Zhang J., Rosenberg H. F. Complementary advantageous substitutions in the evolution of an antiviral RNase of higher primates. Proc. Natl. Acad. Sci. USA (2002) 99:5486–5491.
This article has been cited by other articles:
![]() |
A. Antonelli, J. A. A. Nylander, C. Persson, and I. Sanmartin Tracing the impact of the Andean uplift on Neotropical plant evolution PNAS, June 16, 2009; 106(24): 9749 - 9754. [Abstract] [Full Text] [PDF] |
||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||






