| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
© 2005 Society of Systematic Biologists
The Systematic Component of Phylogenetic Error as a Function of Taxonomic Sampling Under Parsimony
Edited by Tim Collins: Associate Editor
Department of Biological Sciences, University of Cincinnati Box 210006, Cincinnati, Ohio, 45221–0006, USA E-mail: ron.debry{at}uc.edu
| Abstract |
|---|
|
|
|---|
The effect of taxonomic sampling on phylogenetic accuracy under parsimony is examined by simulating nucleotide sequence evolution. Random error is minimized by using very large numbers of simulated characters. This allows estimation of the consistency behavior of parsimony, even for trees with up to 100 taxa. Data were simulated on 8 distinct 100-taxon model trees and analyzed as stratified subsets containing either 25 or 50 taxa, in addition to the full 100-taxon data set. Overall accuracy decreased in a majority of cases when taxa were added. However, the magnitude of change in the cases in which accuracy increased was larger than the magnitude of change in the cases in which accuracy decreased, so, on average, overall accuracy increased as more taxa were included. A stratified sampling scheme was used to assess accuracy for an initial subsample of 25 taxa. The 25-taxon analyses were compared to 50- and 100-taxon analyses that were pruned to include only the original 25 taxa. On average, accuracy for the 25 taxa was improved by taxon addition, but there was considerable variation in the degree of improvement among the model trees and across different rates of substitution.
Keywords: Parsimony; phylogenetic accuracy; phylogenetic error; taxon sampling
Received April 22, 2004; Revised August 5, 2004; Accepted October 7, 2004
Finding optimal phylogenetic trees using parsimony is extremely difficult when there are many taxa. As the number of taxa grows, the number of possible bifurcating trees grows such that it rapidly becomes impossible to evaluate all trees for optimality with current computing limits (Graham and Foulds, 1982). Remarkably, a simulation by Hillis (1996) led him to suggest that it may be easy to recover very large trees with high levels of accuracy. Using a 228-taxon tree originally obtained from real data (Soltis et al., 1997), parsimony analysis of a single simulated data set of 5000 nucleotide characters recovered the exactly correct topology. Even a simple, approximate parsimony algorithm (parsimony-driven stepwise addition, with no branch swapping) obtained a tree that was > 99% identical to the model tree. Hillis (1996) offered two hypotheses based on this result. First, a general property of the parsimony method is that the true tree will become increasingly easy to discover as the density of taxonomic sampling increases. This effect would be due to shortening of the average branch length across the tree, which will cause homoplasy to be dispersed across the tree, rather than being concentrated in pairs of long branches. The dispersal of homoplasy allows detection of the true phylogenetic signal, even in the presence of considerable noise. Second, analysis of very large numbers of taxa would reduce the need for computationally complex methods of analysis. In other words, if parsimony tends to become a highly accurate estimator of tree topology as the number of taxa grows large, then there is no need to pursue more computationally demanding model-based methods, such as maximum likelihood. The first hypothesis has recently been simplified to read that increasing the density of taxonomic sampling will increase the accuracy of phylogenetic inference (e.g., Pollock et al., 2002; Zwickl and Hillis, 2002). This hypothesis has been the subject of intense debate over the past few years, with both supporters (e.g., Purvis and Quicke, 1997; Hillis, 1997, 1998; Graybeal, 1998; Pollock et al., 2002; Zwickl and Hillis, 2002) and detractors (e.g., Kim, 1996, 1998; Rosenberg and Kumar, 2001). The second hypothesis clearly follows, if the first is correct. Given the increasing emphasis on model-based methods in phylogenetics, the behavior of parsimony in large trees warrants further attention.
Differences in methodology are responsible for at least part of the difficulty in reaching a consensus opinion on this topic. For example, different authors have measured "accuracy" in different ways: either as the fraction of the branching pattern of the true tree that is inferred correctly (p), or, out of many true trees, the fraction of those trees that are inferred exactly correctly (f). Thus, Kim (1996), using f, showed that accuracy decreases as the number of taxa increases, whereas simulations by Zwickl and Hillis (2002), using p, showed that accuracy increased as the number of taxa was increased. Another difference across studies is that, in several cases, the sampling space of taxa has differed among trees of different sizes. For example, when Kim (1996) studied the effects of adding taxa, the additional taxa simultaneously increased the taxonomic scope. The total substitution depth of the tree increased along with the number of taxa, so the average branch length did not decrease. Hillis (1998) correctly argued that such a study design is not relevant to the usual definition of increased taxon sampling, which involves addition of taxa within a group, and not addition of distantly related taxa (but see Pollock and Bruno [2000] for evidence that adding taxa outside the group of interest can improve accuracy of maximum likelihood analysis).
Simple logical arguments predict that f will decrease as taxa are added to a phylogenetic problem (Kim, 1996), and Kim's (1996, 1998) analytical method has shown that to be correct. However, Kim's method is only applicable to f, and not to p. Indeed, Yang and Goldman (1997) pointed out that there is no theory describing the expected relationship between the number of taxa and p. Because the current controversy is focused on p, from here forward the term accuracy will refer to p. It appears that simulations are needed to study the relationship between p and the number of taxa. The simulations published to date appear to support the hypothesis that, under parsimony, p will increase with the number of taxa. Although Hillis (1996) did not vary the number of taxa, the result that an entire 228-taxon tree could be inferred correctly is consistent with the hypothesis that accuracy increases with the number of taxa. However, the specific model tree used in those 228-taxon simulations may have been one that was particularly easy for parsimony analysis (Kumar and Gadagkar, 2000). A recent series of papers (Rosenberg and Kumar, 2001, 2003; Pollock et al., 2002; Zwickl and Hillis, 2002; Hillis et al., 2003) appears to have resulted in agreement that increasing taxon sampling generally increases accuracy. The debate between those two groups has focused more on whether the increase in accuracy is strictly due to the addition of taxa (Pollock et al., 2002; Zwickl and Hillis, 2002; Hillis et al., 2003), or if the main effect of adding taxa is to bring additional characters into the analysis (Rosenberg and Kumar, 2003).
The entire series of studies is based on a single 66-taxon model tree inferred from sequence data from 14 mammalian orders (Murphy et al., 2001), and so provides little evidence regarding the generality of these results across trees with different shapes. Further, those previous simulation studies have either contrasted the effects of increasing the number of characters against the effects of increasing the number of taxa (e.g., Rosenberg and Kumar, 2001, 2003) or held the number of characters constant at a relatively low value (e.g., Zwickl and Hillis, 2002). In either case, p was always measured as the deviation from the true, or model, tree. Such a design confounds the two different kinds of phylogenetic error (Swofford et al., 1996): systematic error and random error. Systematic error occurs when the most parsimonious tree (MPT) for a data set with an infinite number of characters is not identical to the model tree. I will hereafter refer to this hypothetical MPT found using an infinitely large data set as MPT
. Random error can be divided into two distinct types: tree-search error and sampling error. Tree-search error occurs when the search algorithm fails to find the correct MPT for a specific data set. It has been suggested that increased character sampling will reduce the chance of a tree searching error (Hillis et al., 1994), but there have been no studies of the relationship between number of characters and success of heuristic search algorithms in large trees. The level of tree-search error cannot be determined with certainty for a phylogeny with many taxa, but may be estimated by performing multiple heuristic searches from randomly chosen starting points and evaluating the frequency with which the same estimated MPT is found across searches. Sampling error occurs when the true MPT for a finite data set is not identical to MPT
(which may or may not be identical to the true tree, depending on whether or not there is any systematic error). Increasing the number of characters will reduce sampling error (so long as the additional characters are sampled from a homogeneous source). It is not clear if the level of sampling error at a fixed number of characters will increase or decrease with increasing taxonomic sampling. However, sampling error can be estimated in simulations, by generating multiple independent data sets on the same tree with the same number of characters and evaluating the similarity in MPT topology among replicate data sets. Note, however, that both kinds of random error are properly referenced to MPT
, and not to the true (or model) tree. When Hillis (1996) suggested that parsimony is a highly accurate estimator of phylogeny for dense taxon sampling, he meant that the systematic component of error becomes small, so that MPT
becomes more similar to the true tree as the density of taxonomic sampling is increased. Thus, only by determining (or, at least, estimating) MPT
can we adequately test Hillis's (1996) hypothesis.
The purpose of this study is to further explore the relationship between the number of taxa and the accuracy of parsimony inference. I use simulations, based on several different model trees. The model trees are themselves generated by a simulated birth-death-sampling process, rather than by analysis of real data. Using simulated trees carries the risk that the models may not have a realistic shape, but it eliminates the risk that a real tree may have been inferred incorrectly (either in terms of branching order or branch lengths) and, thus, represent a tree shape that is inherently "easy" for parsimony analysis. I try, to the degree that it is practical, to separate the three sources of phylogenetic error. In particular, I attempt to isolate the systematic component of error under parsimony by using only data sets containing a very large number of characters, in order to estimate MPT
. To allow direct comparisons between this and previous studies, I adhere to Hillis's definition of accuracy ("percent tree correct"; p) and I restrict analysis to cases in which all trees (including subsets) are of the same substitution depth or "diameter" (Zwickl and Hillis, 2002).
| Methods |
|---|
|
|
|---|
Tree Generation
A total of 8 computer-generated, rooted trees, each with 100 taxa, are examined here in detail. All trees were generated using a birth-death model of speciation (Nee et al., 1994; Yang and Rannala, 1997). To provide at least a minimal level of diversity among trees, the trees were generated using 2 different parameter sets. Four trees were generated using the evolver module of PAML (version 3.14a; (Yang, 1997) and 4 were generated using Phyl-O-Gen (Rambaut, 2002). For the PAML trees, parameter values were: birth rate = 0.4, death rate = 0.2, substitution depth = 0.2 (called "tree height" in PAML, measured in units of substitutions per site in one lineage from the root to the tip, although the absolute vales of the branch lengths were overridden when sequence evolution was simulated—see below), and random taxonomic sampling. Taxonomic sampling was based on a complete clade of 2500 extant species by setting the species sampling parameter
= 0.04. PAML produces only ultrametric trees. These were altered to include rate variation among taxa by multiplying each branch length by a normally distributed random number with mean = 1 and variance = 0.3. To avoid hard polytomies, a minimum branch length of 0.0001 was imposed for these trees. This parameter set produced trees with a mixture of long and short internal branches. Another 4 trees were generated using Phyl-O-Gen. A complete clade of 2500 extant taxa was generated using two distinct phases: an initial rapid growth phase with a birth rate/death rate ratio = 5:1, which was maintained until a total of 2000 extant taxa were present; then a slow-growth phase with a birth rate/death rate ratio = 1.2:1, which was maintained until a total of 2500 extant taxa was reached. This parameter set produced trees with generally short internal branches and long terminal branches. The unmodified ultrametric branch lengths were used for the Phyl-O-Gen trees. From each 2500-taxon Phyl-O-Gen tree, a single random sample of 100 taxa was generated and used as the starting tree for the current study. A sample tree from each of the 2 sources is shown in Figure 1. The two different tree-generation schemes produce trees with different shapes, in addition to the obvious difference of a clock in the Phyl-O-Gen trees and no clock in the PAML trees. In the Phyl-O-Gen trees, most of the bifurcation events are located in the half of the tree closest to the root, whereas in the PAML trees the bifurcations are spread more evenly across the whole time range of the tree.
|
Generation of stratified subsamples
Subsamples of 25 and 50 taxa were generated from each of the eight 100-taxon trees, following a scheme that expands on the one used by Rosenberg and Kumar (2003). This scheme was designed to mimic the following realistic situation. A large monophyletic clade (with a total of 2500 extant taxa) is first examined by obtaining a sample of 25 taxa that span the full diversity of the clade. After analysis of those initial 25 taxa, an additional 25 taxa are randomly sampled. After analysis of the 50-taxon sample, another 50 taxa are sampled, resulting in a final study that includes 100 taxa. This scheme, with replication, was implemented in the following way. For each of the eight 100-taxon trees, 3 independent subsets of 25 taxa were randomly chosen. Each subset was checked to ensure that it included lineages on both sides of the most basal bipartition on the full 100-taxon tree (this represents an attempt to sample the full diversity of the large clade, and also ensures that all subsample analyses include problems with equivalent diameter). For each 25-taxon subset, 4 independent sets of 25 additional taxa were chosen at random from the pool of 75 "unselected" taxa. The original 25 plus each set of 25 newly selected taxa comprised the four 50-taxon subsamples. This sampling scheme thus created a total of three 25-taxon subsamples and twelve 50-taxon subsamples from each 100-taxon model tree.
Sequence Generation
DNA sequences were simulated along each model tree using the evolver module of PAML (version 3.14a; Yang, 1997). Following the recommendation of Zwickl and Hillis (2002), a moderately complex model of evolution was used for simulated sequence evolution. All data sets were generated using an HKY+
model (Hasegawa et al., 1985; Yang, 1993). Parameter values for the HKY+
model were transition/transversion parameter
= 8, gamma shape parameter
= 0.4, and nucleotide frequencies = 0.1, 0.2, 0.3, and 0.4 (for T, A, C, and G, respectively). All data sets were simulated on the full 100-taxon trees. Analyses of taxonomic subsets used the same data generated for the full 100 taxa by using the "delete taxa" and "restore taxa" commands in PAUP. All trees and subsets were examined under 2 different levels of overall substitution rate. This was accomplished by using the branch length multiplier parameter "m" in the evolver control file, which scales the tree so that the sum of all branch lengths equals m (this also scaled the Phyl-O-Gen trees to be the same substitution depth as the PAML trees). In the following analyses, "lower rate" simulations used m = 4, where "higher rate" simulations used m = 20. The PAML trees originally had branch lengths that summed to about 10, so m = 4 is equivalent to an average root-to-tip distance of about 0.08 substitutions per site, whereas m = 20 is equivalent to an average root-to-tip distance of about 0.4 substitutions per site. The Murphy et al. (2001) 66-taxon mammal data set produces a tree with an average root-to-tip distance of about 0.2 substitutions per site, so the values used here fall within a range of substitution rates that might be encountered with real data.
Analyses
All analyses were conducted using parsimony, using the computer program PAUP* 4.0b10 (Swofford, 2000). Each search began from a starting tree obtained by parsimony-driven sequential addition with taxa added in a randomly determined order, followed by TBR branch swapping (see Swofford et al., 1996, for a description of these methods). Multiple searches were conducted for each data set, using different taxon-addition orders. Preliminary studies using 20 to 100 replicate searches indicated that, with 106 characters, 5 replicate searches were generally sufficient to find the MPT. In a few cases where 5 searches found 4 or 5 different trees, new data were generated and those data were subjected to 20 replicates for each search. No limit was placed on the number of trees saved at any step (but in no instance was more than a single tree held at the end of any search). Error (p) is measured as the Robinson-Foulds distance (Robinson and Foulds, 1981; Penny and Hendy, 1985) between the MPT and the reference tree, divided by twice the number of internal branches. This is the same measure of p used by Hillis (1996), Rosenberg and Kumar (2001), and Zwickl and Hillis (2002).
The MPT from each of the large data sets is intended to approximate the tree that would be most parsimonious if the data set were infinitely large (MPT
). However, even with 106 characters, different data sets may have different MPTs. The following procedure was used to estimate MPT
for each model tree. A minimum of 5 independent, very large data sets was simulated on each 100-taxon model tree. Initially, data sets were generated with 106 characters per taxon. If at least 4 of the 5 independent data sets gave the same MPT (for either the full 100-taxon set or for any given subset), that tree was deemed to be the estimated MPT
. If the 5 data sets did not agree, then at least 5 larger data sets were generated and the procedure was repeated. Computer memory constraints limited maximum data set size to 2 x 106 characters. When the 2 x 106-character data sets did not agree, but there was a clear majority, the most common tree among the replicates was chosen as the estimated MPT
. When there was no clear majority, the tree with the smallest error (relative to the true tree) was used as the estimated MPT
.
| Results |
|---|
|
|
|---|
Success of Heuristic Searches with Very Large Data Sets
Using large data sets to estimate MPT
requires first that the MPT can be found for each data set. This cannot be proven, of course, with so many taxa. In order to assess the success at finding the MPT with the basic heuristic search strategy of parsimony-driven sequential addition plus TBR branch swapping, every search was repeated a minimum of 5 times with different, randomly chosen taxon addition orders. The search strategy appears to be reasonably effective at identifying the MPT when used with these very large data sets. Across all the searches conducted on data generated under the lower substitution rate (including all 100-taxon models, all subsets examined under each model, and all 5 replicate data sets—over 600 distinct analyses), a single, identical MPT was found in all 5 replicate searches > 80% of the time. A single, identical MPT was found in at least 4 of the 5 replicate searches > 95% of the time. For the data sets generated under the higher substitution rate, the heuristic searches were less successful—a single, identical MPT was found in at least 4 of the 5 searches only 56% of the time. There was considerable variability in search success across different model trees at the higher substitution rate. At the extremes, one model tree gave an identical MPT in at least 4 of 5 replicate searches 75% of the time, whereas another model tree gave an identical MPT in at least 4 of 5 replicate searches only 19% of the time. In order to determine how much impact tree search error may have on the subsequent analyses, the model tree with the lowest frequency of finding a single, identical MPT across replicate searches was reanalyzed using 20 replicates for every search. In that case, the tree that was ultimately most parsimonious out of the 20 replicates was found within the first 5 replicate searches 93% of the time, so it appears that the strategy of using only 5 replicate searches did not significantly alter the ability to find the MPT for these data sets and subsets. For this same model tree, all the results described below were compared between the analyses using 5 replicate searches and those using 20 replicates. Only very minor differences were observed in any of the measures described below.
Estimating MPT
Adequate estimation of MPT
requires elimination (or minimization) of random sampling error, in addition to minimizing tree search error. Data sets must be large enough so as to be functionally equivalent to an infinitely large data set. Using
106 simulated characters did not fully achieve the objective of eliminating random sampling error under parsimony, but the remaining sampling error was small relative to the systematic component of phylogenetic error. In other words, the trees found using multiple, independent data sets were more similar to each other than they all were to the true (model) tree. The magnitude of the sampling error generally varied with the inverse of the number of taxa. For the 25-taxon subsets, replicate data sets almost always produced identical MPTs. For the full 100-taxon data sets, there were a total of 16 different analyses (8 trees, with 2 different substitution rates for each tree). The identical MPT was found for each of the 5 replicates in only 5 of those 16 cases. However, across all 16 analyses the average RF distance between replicate MPTs was only 2.6, whereas the average RF distance between the MPT and the model (true) tree was 16.9. Adequate estimation of MPT
could also be hindered if the model trees include many branches on which the expected number of total substitutions is close to 0 (Kumar and Gadagkar, 2000). Even at the lower substitution rate, however, branches with expected length near 0 are not likely to be a problem in the present study. The PAML trees were generated with a minimum branch length of 4 x 10–6 at the lower substitution rate. Any branch that short would be expected to have an average of 4 changes per data set. The Phyl-O-Gen trees were not generated with any specific restriction, but the minimum branch length across all branches on all 4 trees used here was slightly over 2 x 10– 6, and only 3 branches across all 4 Phyl-O-Gen trees had a length of less than 4 x 10– 6 at the lower substitution rate.
The Relationship Between Accuracy and Taxonomic Sampling
Across all model trees and both substitution rates, average error was 10.7% for the 25-taxon subsamples, 10.4% for the 50-taxon subsamples, and 8.9% for the full 100-taxon trees. Using the error level for the 25-taxon subsamples as a reference point, increasing sampling from 25 to 50 taxa reduced the average error by about 3% and increasing sampling from 25 to 100 taxa reduced the average error by about 17%. Despite the reductions in average error as taxon sampling is increased, the level of error actually increased in a majority (106 of 192) of the individual replicates when sampling was increased from 25 to 50 taxa. The level of error also increased in exactly half (24 of 48) of the individual replicates when sampling was increased from 25 to 100 taxa.
The overall level of error across the full tree may not be the most appropriate measure on which to base a decision regarding adding taxa to a phylogenetic study, depending on the question of greatest interest. The stratified sampling scheme adopted here allows us to address a question that may be quite common: if we are concerned only with estimating the original 25-taxon phylogeny, does increasing taxon sampling to 50 or 100 taxa improve the accuracy with which those original 25 taxa are inferred? We can examine this question because each of the 25-taxon subsets is fully included in 4 of the 50-taxon subsets (and, of course, in 1 of the 100-taxon trees). This is similar to the approach taken by Pollock et al. (2002) and Rosenberg and Kumar (2003). Pollock et al. (2002) suggested measuring the effect of increased taxon sampling as
E = (Es – Ep)/Es, where Es is the error in the smaller taxonomic sample (the subsample analysis) and Ep is the error obtained when an analysis using the larger sample is pruned to include only the taxa in the smaller subsample (the pruned analysis).
E provides an intuitive measure of the effect of increased taxonomic sampling: the fraction of error that is removed by increased taxonomic sampling. But the intuitive nature of
E holds only so long as Es > Ep. If Es can be less than Ep, then
E is not symmetrical. It is bounded at +1, but unbounded below 0 and undefined if Es = 0. For example, if Es = 0.08 and Ep = 0.02, then
E = 0.75. But the same magnitude of change in the opposite direction gives a very different result: if Es = 0.02 and Ep = 0.08, then
E = –3. The asymmetry in
E caused no problems for Pollock et al. (2002), because, under their simulation conditions, the error always either stayed constant or decreased with increased taxonomic sampling. In the present study, however, error frequently increases with increased taxonomic sampling (Table 1), particularly when the 25-taxon subset analyses are compared to the 50-taxon pruned analyses. Indeed,
E is undefined for 32 of the 192 individual comparisons. Both Es = 0 and Ep = 0 in 13 of those 32 replicates. These could reasonably be assigned a
E value of 0. But in the remaining 19 cases, the 25-taxon subtree is inferred with perfect accuracy, whereas errors affecting those 25 taxa are introduced when sampling is increased to 50 taxa, so
E is truly undefined in those cases.
|
If
E values cannot be used directly, then how do we quantify the impact of increased taxonomic sampling on accuracy? In the following results, I report a simple count of replicates in which error is increased or decreased by increased taxonomic sampling, regardless of the magnitude of the change. Then, to incorporate a measure of the magnitude of changes in error as taxonomic sampling is increased, I report
E values that are based on averaging the error across replicates before
E is calculated. Overall, increasing taxon sampling from 25 to 50 taxa increased accuracy for the 25-taxon tree in 41% of the replicates and decreased accuracy in 30% of the replicates (Table 1). There is, however, a noticeable effect of tree shape and substitution rate. For only the trees generated using PAML (both substitution rates combined), accuracy for the 25-taxon tree decreased more often than increased when taxon sampling was increased from 25 to 50 taxa. For only the data sets generated under the lower substitution rate (across all 8 trees combined), accuracy also decreased in a plurality of replicates when taxon sampling was increased from 25 to 50. This pattern is amplified when the results are partitioned according to both tree shape and substitution rate. At one extreme (Phyl-O-Gen trees at the higher substitution rate), an increase from 25 to 50 taxa improved accuracy in 48% and degraded accuracy in only 21% of the replicates. At the other extreme (PAML trees at the lower substitution rate), increasing sampling from 25 to 50 taxa improved accuracy in only 17% and degraded accuracy in 50% of the replicates. These results are not likely to be an artifact of using trees generated in two different ways. Across all runs, identical settings of the m parameter resulted in very similar numbers of constant, variable, and parsimony-informative sites for both the PAML trees and the Phyl-O-Gen trees.
Increasing sampling from 25 to 100 taxa improved accuracy more often, compared to the 50-taxon analyses (Table 2). Across all conditions combined, increasing sampling from 25 to 100 taxa increased accuracy in 58% of the replicates, compared to only 19% of the replicates in which accuracy decreased. As with the pruned 50-taxon results, increasing sampling from 25 to 100 taxa improved accuracy most often for Phyl-O-Gen trees and the higher substitution rate, and least often for PAML trees and the lower substitution rate.
|
To assess the average magnitude of
E across all the simulations performed here, the RF error values were averaged across replicates prior to calculation of
E. Calculated in this way, increasing taxon sampling from 25 to 50 taxa removed approximately 17% of the error in the 25-taxon data set, across all replicates combined (Table 1). Increasing taxon sampling from 25 to 100 taxa removed 49% of the 25-taxon error, across all replicates combined (Table 2). The average error removed showed variation between the 2 tree shapes and between the 2 substitution rates similar to that seen when the results are tabulated as counts. Of specific interest are the results for data generated at the lower substitution rate on the PAML trees. Increasing taxon sampling from 25 to 50 taxa increased, rather than decreased, average error by about 11% for the 48 replicates that share that combination (Table 1). However, increasing sampling from 25 to 100 taxa for the same combination (PAML trees at the lower substitution rate) did decrease the average error, by about 19% (Table 2). | Discussion |
|---|
|
|
|---|
The question "should we add more taxa or more characters?" has been asked many times in recent years. The present simulations differ from other recent analyses in that I address only the "more taxa" part of that question. The current study focuses only on the systematic component of error in parsimony analysis, so the results herein may tell us more about the underlying behavior of the parsimony method than they do about making a specific choice between more characters or more taxa in any particular real situation. The results generally support the conclusions of Zwickl and Hillis (2002) and Pollock et al. (2002) that the accuracy of parsimony analysis improves with increased taxonomic sampling. Both the average overall accuracy and the average accuracy for a specific taxonomic subset increased with increased taxon sampling. Unlike those two prior studies, however, these simulations show variation among trees, and include a nontrivial fraction of cases in which accuracy decreased with increased taxon sampling.
The results pertaining to accuracy across the entire tree as a function of taxon sampling may be compared most directly with results presented by Rosenberg and Kumar (2001) and Zwickl and Hillis (2002). In the present study, overall accuracy increased as sampling was increased from 25 to 50 taxa, and increased again as sampling was increased from 50 to 100 taxa. However, the magnitude of this increase is much smaller than that reported by Zwickl and Hillis (2002). In Zwickl and Hillis' (2002) results, increasing sampling from 25 to 50 taxa improved overall accuracy by approximately 30% (based on their Fig. 5b), compared to only about a 3% improvement in both the present study and Rosenberg and Kumar (2001). In Zwickl and Hillis' (2002) study, increasing sampling from 25 to 66 taxa (the maximum available in their design) improved accuracy by approximately 50%, compared to only a 17% improvement for a much larger increase in taxon sampling (25 to 100 taxa) in the present study. Indeed, extrapolating the straight-line relationship presented in Zwickl and Hillis' (2002) Fig. 5b to 100 taxa suggests an expectation of near perfect accuracy at 100 taxa under their simulation conditions. For the conditions used here, the 100-taxon trees were inferred with an average error of nearly 9%. Clearly, if parsimony does become a fully consistent estimator of phylogeny with dense taxon sampling, it will only happen with more taxa than were examined here. Zwickl and Hillis (2002) did not provide details regarding among-replicate variability, so there is no way to determine if any of their 25-taxon subsamples had higher accuracy than some of their 50-taxon subsamples, or higher accuracy than the full 66-taxon tree. In the present study, overall accuracy decreased, rather than increased, in a majority of replicates when sampling was increased from 25 to 50 taxa, and decreased in half of the replicates when sampling was increased from 25 to 100 taxa.
Overall accuracy will often not be the most appropriate measure of success when deciding if it is worth adding taxa to an existing study. Suppose, for example, that a study with 25 taxa includes representatives of all or most of the recognized major lineages in a group. The goal of adding taxa may be to improve the accuracy of estimation among those basal splits. Adding more taxa may not be deemed worthwhile if it retains all the errors initially made in the 25-taxon analysis, even if the overall accuracy is improved. Conversely, adding taxa may be well worth the effort if errors made at the 25-taxon level are likely to be corrected, even if other errors may be introduced in the new parts of the tree.
The results pertaining to accuracy of a subset analysis compared to accuracy of a larger, pruned analysis may be compared to the results obtained by Pollock et al. (2002) and Rosenberg and Kumar (2003). Using a subset of the original simulations from Rosenberg and Kumar (2001), Pollock et al. (2002) examined the effect of increased taxonomic sampling on the accuracy with which a smaller taxonomic subset is inferred. They observed a simple pattern: pruning the full 66 taxon tree always gave at least the same accuracy (and nearly always gave higher accuracy) for a randomly chosen subset of taxa, compared to the analyses that utilized only the data for the smaller taxonomic sample. Rosenberg and Kumar (2003; fig. 1) also obtained increased accuracy with increased taxon sampling in nearly every case. In contrast, in the present study there were many instances in which accuracy decreased when more taxa were included. Indeed, for 3 individual model trees at the lower substitution rate, accuracy for a 25-taxon subset was never observed to improve with increased taxon sampling. On average, however, the present results support the conclusion that increasing taxon sampling does increase the accuracy of parsimony analysis (so long as the individual RF error values are averaged across all replicates before
E is calculated). Further study will be needed to determine if this trend continues for even larger increases in taxonomic sampling. More importantly, the current results show that the shape of the tree might have a considerable effect on the advantage (or disadvantage) of increasing taxonomic sampling. Any generalities should, therefore, await study of a larger range of tree shapes.
The suggestion that the impact of increased taxon sampling may depend on the shape of the tree raises the question: what sort of model trees should we use in simulation studies of accuracy? The present study used artificial trees, whereas previous studies used trees that had been inferred from real data (Hillis, 1996; Rosenberg and Kumar, 2001; Pollock et al., 2002). It is possible to criticize use of model trees from either source. Artificial trees may be generated in shapes that are never found in nature, whereas trees inferred from real data may not accurately reflect the true shape of the tree that generated the data. If the method used to infer the real phylogeny is biased in favor of making certain types of error (either in terms of branching order or branch lengths), then the resulting model tree may be particularly easy to reconstruct. Using a method to infer the original tree that is different from the method(s) used to test the impact of taxon sampling may eliminate this potential flaw, depending on whether or not the methods in question share similar biases.
It may be questioned whether 106 characters represents a sufficiently large sample to justify the claim that the systematic component of phylogenetic error has been isolated. In the current study, it is indeed likely that the estimated MPT
is not the true MPT
in some cases, because the number of replicate data sets was very small (this is particularly a problem in those instances in which all replicate data sets did not give the same topology). Nevertheless, Kim's (1996) assessment that analysis of consistency by simulation would be virtually impossible is almost certainly overly pessimistic. Kim asserted that random error will be eliminated only when enough characters are generated so that every possible site pattern is sampled with a certain minimum frequency. By that argument, however, the "easiest" parts of the tree (clades that have a long common ancestor branch) will be the ones that cause the most difficulty—because site patterns that contradict such a clade will be extremely unlikely. In fact, random sampling error is eliminated for those easy parts of the tree even with much smaller numbers of characters. It is the clades with a short common ancestor branch for which random error is not completely eliminated until much larger numbers of characters are generated. In the present case, at least, generating and analyzing multiple independent data sets demonstrated that the remaining random error is quite small relative to the systematic error. Another worry is that the tree that is most parsimonious for 106 characters may simply be different from the tree that is most parsimonious for a truly infinite number of characters. For a much smaller problem (5 taxa), Holland et al. (2003) recently discovered that the most common MPT for a small number of characters can be different from the tree that would be most parsimonious if an infinite supply of characters were available. For their 5-taxon problem, this tendency to be misleading occurred only in a very narrow region of the branch-length parameter space, and disappeared when the number of characters was > 5000. It is not clear, however, if such an effect might apply to 25-, 50-, or 100-taxon trees and, if so, if the number of characters required to overcome the effect might be much larger in larger trees. In any case, it seems likely that any such errors would be small compared to the observed differences between the estimated MPT
s and the true trees.
By design, this study holds the number of characters constant, and so does not speak directly to the core issue of the recent debate (Rosenberg and Kumar, 2001, 2003; Zwickl and Hillis, 2002; Pollock et al., 2002; Hillis et al., 2003) concerning the relative merit of increasing the number of characters versus increasing the number of taxa. As discussed by Hillis et al. (2003), the effect of increasing the number of characters will vary, depending on the number of characters already examined. If the initial number of characters is small, then an increase in character sampling will go a long way toward achieving convergence. So long as the tree being converged on is the true tree, or is very similar to the true tree, then adding characters should produce improvement in accuracy in those cases. If the number of characters is already large and if convergence is already nearly achieved, then adding more characters will make little or no difference in the inferred phylogeny.
The conclusions presented here only apply to parsimony. When accuracy does decrease with added taxon sampling, that is a consequence of the particular requirements on tree shape that are imposed by parsimony. For other methods, the relationship between branch lengths and consistency may be different. If a substantially correct model is used, then either the minimum-evolution method or maximum likelihood (ML) should be consistent regardless of the number of taxa or pattern of branch lengths. However, it has been shown that use of an incorrect model will result in inconsistency problems for NJ that can be quite similar to those under parsimony (DeBry, 1992), and that use of an incorrect model can result in an increase in phylogenetic error using NJ as more and more taxa are examined (Strimmer and von Haeseler, 1996). However, other considerations may argue in favor of increasing taxon sampling under other methods, particularly ML. For ML, including more taxa improves estimation of model parameters (Pollock and Bruno, 2000), and, thus, improves the chance that ML will be a consistent estimator of the phylogeny. Finally, both the present and other recent studies (Zwickl and Hillis, 2002; Rosenberg and Kumar, 2003) examine only the effects of adding randomly chosen taxa on accuracy of an initial phylogenetic investigation. It is possible that larger improvements in the accuracy for a small subset of taxa can be achieved by targeted addition of taxa (Goldman, 1998; Hillis, 1998).
| Acknowledgements |
|---|
I thank H. Kishino, S. Kumar, D. Pollock, J. Thorne, and one anonymous reviewer for insightful comments that substantially improved this study. This material is based upon work supported by the National Science Foundation under Grant No. 0075306.
| References |
|---|
|
|
|---|
-
DeBry R. W. The consistency of several phylogeny-inference methods under varying evolutionary rates. Mol. Biol. Evol. (1992) 9:537–551.[Abstract]
Goldman N. Phylogenetic information and experimental design in molecular systematics. Proc. R. Soc. Lond. B Biol. Sci. (1998) 265:1779–1786.[Medline]
Graham R. L., Foulds L. R. Unlikelihood that minimal phylogenies for a realistic biological study can be constructed in reasonable computational time. Math. Biosci. (1982) 60:133–142.[CrossRef][Web of Science]
Graybeal A. Is it better to add taxa or characters to a difficult phylogenetic problem? Syst. Biol. (1998) 47:9–17.
Hasegawa M., Kishino H., Yano T. Dating of the human-ape splitting by a molecular clock of mitochondrial DNA. J. Mol. Evol. (1985) 22:160–174.[CrossRef][Web of Science][Medline]
Hillis D. M. Inferring complex phylogenies. Nature. (1996) 383:130–131.[CrossRef][Medline]
Hillis D. M. Are big trees indeed easy? Reply. Trends Ecol. Evol. (1997) 12:358.
Hillis D. M. Taxonomic sampling, phylogenetic accuracy, and investigator bias. Syst. Biol. (1998) 47:3–8.
Hillis D. M., Huelsenbeck J. P., Swofford D. L. Hobgoblin of phylogenetics? Nature (1994) 369:363–364.[CrossRef][Medline]
Hillis D. M., Pollock D. D., McGuire J. A., Zwickl D. J. Is sparse taxon sampling a problem for phylogenetic inference? Syst. Biol. (2003) 52:124–126.
Holland B. R., Penny D., Hendy M. D. Outgroup misplacement and phylogenetic inaccuracy under a molecular clock—a simulation study. Syst. Biol. (2003) 52:229–238.
Kim J. General inconsistency conditions for maximum parsimony: Effects of branch lengths and increasing numbers of taxa. Syst. Biol. (1996) 45:363–374.
Kim J. Large-scale phylogenies and measuring the performance of phylogenetic estimators. Syst. Biol. (1998) 47:43–60.
Kumar S., Gadagkar S. R. Efficiency of the neighbor-joining method in reconstructing deep and shallow evolutionary relationships in large phylogenies. J. Mol. Evol. (2000) 51:544–553.[Web of Science][Medline]
Murphy W. J., Eizirik E., Johnson W. E., Zhang Y. P., Ryder O. A., O'Brien S. J. Molecular phylogenetics and the origins of placental mammals. Nature. (2001) 409:614–618.[CrossRef][Medline]
Nee S., May R. M., Harvey P. H. The reconstructed evolutionary process. Phil. Trans. R. Soc. Lond. B Biol. Sci. (1994) 344:305–311.
Penny D., Hendy M. D. The use of tree comparison metrics. Syst. Zool. (1985) 34:75–82.
Pollock D. D., Bruno W. J. Assessing an unknown evolutionary process: Effect of increasing site-specific knowledge through taxon addition. Mol. Biol. Evol. (2000) 17:1854–1858.
Pollock D. D., Zwickl D. J., McGuire J. A., Hillis D. M. Increased taxon sampling is advantageous for phylogenetic inference. Syst. Biol. (2002) 51:664–671.
Purvis A., Quicke D. L. J. Are big trees indeed easy? Reply. Trends Ecol Evol. (1997) 12:357–358.
Rambaut A. Phyl-O-Gen v1.1 (2002) Available at: http://evolve.zoo.ox.ac.uk/.
Robinson D. F., Foulds L. R. Comparison of phylogenetic trees. Math. Biosci. (1981) 53:131–147.[CrossRef][Web of Science]
Rosenberg M. S., Kumar S. Incomplete taxon sampling is not a problem for phylogenetic inference. Proc. Natl. Acad. Sci. USA (2001) 98:10751–10756.
Rosenberg M. S., Kumar S. Taxon sampling, bioinformatics, and phylogenomics. Syst. Biol. (2003) 52:119–124.
Soltis D. E., Soltis P. S., Nickrent D. L., Johnson L. A., Hahn W. J., Hoot S. B., Sweere J. A., Kuzoff R. K., Kron K. A., Chase M. W., Swensen S. M., Zimmer E. A., Chaw S.-M., Gillespie L. J., Kress W. J., Sytsma K. J. Angiosperm phylogeny inferred from 18S ribosomal DNA sequences. Ann. Mo. Bot. Gard. (1997) 84:1–49.[CrossRef]
Strimmer K., von Haeseler A. Accuracy of neighbor joining for n-taxon trees. Syst. Biol. (1996) 45:516–523.
Swofford D. PAUP*: Phylogenetic analysis using parsimony (*and other methods) (2000) Sunderland, Massachusetts: Sinauer Associates.
Swofford D. L., Olsen G. J., Waddell P. J., Hillis D. M. Phylogenetic Inference. In: Molecular systematics—Hillis D. M., Moritz C., Mable B. K., eds. (1996) Sunderland, Massachusetts: Sinauer Associates. Pages 407–514.
Yang Z. PAML: A program package for phylogenetic analysis by maximum likelihood. CABIOS (1997) 13:555–556.[Medline]
Yang Z., Rannala B. Bayesian phylogenetic inference using DNA sequences: A Markov Chain Monte Carlo Method. Mol. Biol. Evol. (1997) 14:717–724.[Abstract]
Yang Z. H. Maximum-likelihood-estimation of phylogeny from DNA-sequences when substitution rates differ over sites. Mol. Bio. and Evol. (1993) 10:1396–1401.
Yang Z. H., Goldman N. Are big trees indeed easy? Trends Ecol. Evol. (1997) 12:357.
Zwickl D. J., Hillis D. M. Increased taxon sampling greatly reduces phylogenetic error. Syst. Biol. (2002) 51:588–598.
This article has been cited by other articles:
![]() |
T. A. Heath, D. J. Zwickl, J. Kim, and D. M. Hillis Taxon Sampling Affects Inferences of Macroevolutionary Processes from Phylogenetic Trees Syst Biol, February 1, 2008; 57(1): 160 - 166. [Full Text] [PDF] |
||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||

