© 2006 Society of Systematic Biologists
Detecting the Node-Density Artifact in Phylogeny Reconstruction
Edited by Thomas Buckley: Associate Editor
School of Biological Sciences, University of Reading Whiteknights, Reading RG6 6AJ, England E-mail: m.pagel{at}reading.ac.uk (M.P.)
| Abstract |
|---|
|
|
|---|
The node-density effect is an artifact of phylogeny reconstruction that can cause branch lengths to be underestimated in areas of the tree with fewer taxa. Webster, Payne, and Pagel (2003, Science 301:478) introduced a statistical procedure (the "delta" test) to detect this artifact, and here we report the results of computer simulations that examine the test's performance. In a sample of 50,000 random data sets, we find that the delta test detects the artifact in 94.4% of cases in which it is present. When the artifact is not present (n = 10,000 simulated data sets) the test showed a type I error rate of approximately 1.69%, incorrectly reporting the artifact in 169 data sets. Three measures of tree shape or "balance" failed to predict the size of the node-density effect. This may reflect the relative homogeneity of our randomly generated topologies, but emphasizes that nearly any topology can suffer from the artifact, the effect not being confined only to highly unevenly sampled or otherwise imbalanced trees. The ability to screen phylogenies for the node-density artifact is important for phylogenetic inference and for researchers using phylogenetic trees to infer evolutionary processes, including their use in molecular clock dating.
Keywords: Delta test; molecular clock; molecular evolution; node-density effect; phylogenetic reconstruction; speciation; simulation
Received September 2, 2005; Revised November 11, 2005; Accepted January 11, 2006
Fitch and Bruschi (1987) and Fitch and Beintema (1990) identified an artifact of phylogeny reconstruction that has come to be known as the node-density effect. These authors noted that branch lengths will tend to be better estimated in parts of a tree where more taxa have been sampled. Conversely, where taxon sampling is sparse or the amount of change between successive nodes of the tree is large, phylogenetic reconstruction methods will tend to underestimate the true amount of change. This is because in longer branches of a tree, multiple "hits," or two or more changes at a given site, are common. These multiple hits are mostly invisible and get reconstructed as one change, causing branch lengths to be underestimated. The effect never disappears but will be smaller in shorter branches of the tree, where fewer multiple hits are expected.
Summed over all of the branch lengths of a phylogeny, this artifact can cause an apparent relationship between the number of nodes and the total inferred amount of change. Where there has been more net speciation (more internal nodes of the tree), the true amount of change along each branch is better estimated, giving the appearance that there has been more total evolution along the summed path from the root of the phylogeny to its tips. The effect increases with the number of nodes included along a path until the total path length approaches the true length. This leads to the expectation of a curvilinear relationship between the reconstructed length of a path and the number of nodes along that path. Figure 1a shows a phylogeny in which the artifact is present, and Figure 1b plots the total root-to-tip (species) path lengths against the number of nodes along the path, showing the expected curvilinear trend (see also Fitch and Beintema's [1990] Figure 2, reprinted in Page and Holmes [1998, page 169]).
|
|
Webster et al. (2003) introduced a statistical test to detect phylogenies that suffer from the node-density artifact. Those authors fit a curve of the form n = β x
, where n is the number of nodes, x is the phylogenetic path length, β describes that rate of change between path length and the number of nodes, and
captures any curvature. This is algebraically equivalent to finding a curve of the form x = β* n1/
, where β* = β– 1/
, and we expect
> 1 when the artifact is present. When the data do not suffer from the artifact, there can still be a relationship between path lengths and nodes such that β* > 0, but
1. To test for the artifact, Webster et al. (2003; Supplementary Information) describe a generalized least-squares (GLS) procedure based upon Pagel's (1997), (1999)continuous method. The GLS method assesses the relationship between path lengths and nodes using all of the information in the phylogenetic tree and accounting for phylogenetic relatedness in both measures.
Here we report on the performance of the delta test for detecting the node-density artifact by analyzing simulated gene-sequence data on random phylogenetic trees. Our particular interest is to determine how well the
> 1 criterion identifies trees suffering from the artifact.
| Methods |
|---|
|
|
|---|
Simulation Data
We used PhyloGen (Rambaut, 2002) to simulate 1000 random ultrametric trees of 50 species each. The speciation rate was set to twice that of the extinction parameter (birth = 0.2, death = 0.1, respectively). We then added an artificial outgroup taxon to each tree. This was done to ensure that all the branches leading to the true root were estimated properly (as described below).
For each of the 1000 random topologies we used SeqGen (Rambaut and Grassly, 1997) to generate 50 random gene-sequence data sets of 1000 base pairs. We generated data from the general time-reversible (GTR +
4) model of sequence evolution, choosing the values of the rate parameters in the GTR matrix at random for each data set from the uniform interval between 0 and 20, with the exception of the G
T rate, which was always 1. All base frequencies were assumed to be 0.25. We chose the value of the gamma shape parameter on the uniform interval 0 to 4 and varied the tree length by randomly choosing the root-to-tip distance (substitutions per site) between 0.2 and 2.2 each time an alignment was simulated. This gave us 50,000 data sets in which, because the trees are ultrametric, there is no relationship between the number of nodes along a path and the path length.
We estimated the phylogenetic branch lengths for each of the 50,000 data sets using PAUP* 4.0b10 (Swofford, 2001) and giving it the correct topology. Although the simulated trees were rooted, all branch lengths were estimated on unrooted trees. The artificial outgroup taxon was included and used to estimate where the true root of the tree should be placed along the basal branch. If branch lengths are estimated on rooted trees, maximum likelihood will correctly estimate the total length of the branch leading from the outgroup to the ingroup taxa, but it does not know where to place the root along this branch. If the root is placed such that the arbitrary length of the segment leading from the root to the outgroup is short, this can falsely give the impression of the node-density artifact.
Although PAUP was given the true topology, we used a GTR model of evolution without gamma to infer the branch lengths in each of the 50,000 data sets. The simple GTR model will fail to capture the exact nature of the evolutionary process that gave rise to the data and is therefore expected to produce the node-density effect to varying degrees (see also Zharkikh, 1994).
We generated a further 10,000 data sets of 1000 base pairs using 10 replicates each of the same 1000 simulated trees, and the same range of parameters. For each of these data sets we estimated the branch lengths in PAUP but using a GTR +
4 model. Inferring the branch lengths with the same model as the data were simulated by means that the evolutionary process that gave rise to the data will be well approximated and we do not expect the node-density artifact to be present.
Node-Density Analyses
We removed the artifical outgroup taxon from each data set and rooted the trees at the point the outgroup taxon had identified. Then, for each tree we first tested for a relationship between the reconstructed path length, calculated as the sum of the branches from the root to the tip for each species, and the number of internal nodes along that path, starting at zero for the root and not counting the tip at the end of the path as an additional node (Webster et al., 2003, count species as additional nodes meaning that the values reported in their figure 1 would differ from ours by one. We prefer the present method of counting nodes as it corresponds to speciation events on the tree; see also Discussion.) The relationship between nodes and path lengths is tested by means of a likelihood-ratio (LR) statistic comparing the likelihood of a random-walk model to a directional random-walk model (Pagel, 1997, 1999; Webster et al., 2003; Supplementary Information). The models differ by the parameter β* as described above in the equation for x, where β* measures the regression of path length on nodes. We expect β* = 0 when no artifact is present. If the artifact is present in the data, we expect β* > 0 and that the directional model will provide a better fit. Twice the difference in likelihoods (the LR) is assessed by a
12 distribution.
Because the true trees are ultrametric, a significant association between path length and nodes is evidence, apart from chance effects, for the node-density artifact. In real data the nature of the true tree is not known, and a relationship between the number of nodes and path length could arise for reasons other than the artifact (see, for example, Webster et al., 2003). However, the artifact can be distinguished from other causes by the nature of the relationship it produces between path lengths and nodes. In particular, the delta test asserts that when a significant association has been caused by the artifact, we expect the parameter
to be greater than 1. For each significant directional model (β* significantly > 0), we therefore also separately estimated
and recorded its value (the test makes no predictions about
when the artifact is not present). In practice we find that β* and
are more accurately estimated from n = β x
than from the equivalent regression of path length on nodes (see Appendix), and all of our analyses used this form of the equation. We took any numerical value of
> 1 in conjunction with a significant directional model to be evidence of the node-density effect. The performance of the delta test is measured by the proportion of the simulated data sets with significant associations between nodes and path length in which the parameter
is greater than 1. Software to implement the test is available from www.evolution.reading.ac.uk or www.ams.rdg.ac.uk/zoology/pagel.
Distributional Statistics
Using the methods described above we derived for each data set a likelihood-ratio statistic comparing the directional to the random walk model—this is the test of β*. Under the null hypothesis of no artifact, we expect the cumulative density of LR values to conform to a
12 density. We compared distributions of the LR statistics to these expected
12 densities using the the Kolmogorov-Smirnov (K-S) D statistic.
Tree Shape
To examine whether the shape of the simulated trees influenced the probability of obtaining an artifact, we calculated three measures of tree shape for each tree using the computer program MeSA (Agapow and Purvis, 2002): Colless' (1982) index Ic, a measure of tree imbalance; Shao and Sokal's (1990) B1 index, a measure of tree balance; and Rohlf et al.'s (1990) noncumulative steminess index.
| Results |
|---|
|
|
|---|
The tree in Figure 1 shows the artifact. The LR test of the directional model returns a significant LR of 7.38, the slope β* is estimated to be 0.13, and
= 7.33 (all values estimated by maximum likelihood).
Manipulating the Presence/Absence of the Artifact
We expect to see the artifact at much higher than chance levels in the 50,000-tree data set (hereafter, artifact data), but not in the 10,000-tree data set (hereafter nonartifact data). Figure 2a plots the cumulative distribution of the 10,000 observed LR values for the non-artifact data along with the cumulative distribution of a true
12. The two lines fall on top of each other, and the K-S test confirms that the observed cumulative density does not depart from the expected
12 distribution (D = 0.09559, P = 0.3189): analyzing the simulated data with the model that generated it returns accurately estimated branch lengths. On the other hand, Figure 2b shows that the distribution of the LR statistics resulting from the artifact data set of 50,000 trees is considerably skewed to the right of the expected
12 distribution. This indicates more large LR scores than expected, and the distribution returns a significant K-S test (D = 0.4735, P < 0.0001). Inferring the branch lengths on ultrametric trees using the "wrong" model of sequence evolution gives rise to the node-density artifact.
Detecting the Node-Density Artifact
In the artifact data, 48.67% (n = 24,336) of the simulated data sets showed a significant and positive association between total path length and the number of nodes. The artifact, as measured by the size of the LR statistic, was more likely to arise in trees with greater rate heterogeneity, as indicated by the
-shape parameter of the gamma distribution (r = –0.6024 P < 0.0001), and somewhat more likely to arise in longer trees, (r = 0.272, P < 0.0001). These results are expected: in shorter trees and in trees with minimal rate heterogeneity, the inferred branch lengths capture all or nearly all of the true changes in the data, and the artifact is negligible or not present.
The delta test expects that data sets displaying the artifact will return values of
> 1. In 94.4% of the 24,336 data sets with significant LR statistics, the maximum likelihood estimate of
exceeded 1 (Table 1). Thus the delta test correctly identifies cases of the artifact at a high rate. By comparison, only the expected 5% (5.12%) of the 10,000 nonartifact data sets showed a significant association between nodes and path length. Fewer than half of these (1.95% of the total) showed the positive association expected of the artifact. Of this 1.95% about 87% return an estimate
greater than 1. This means that the delta test has a type I error rate of about 1.7% in these data.
|
Figure 3 shows the LR statistic plotted against the estimate of
for each of the 24,336 artifact data sets with significant positive associations between path length and nodes. As the estimate of
moves past 1, the LR statistic increases sharply. Because
measures the curvature of the relationship, this plot emphasizes that when the node-density artifact is present (LR > 3.84), the expected curvilinear relationship between path length and nodes arises, such as in Figure 1. The opposite point also holds: values of
< 1 are not expected when the artifact is present and Figure 3 confirms this with only 5.6% of the estimated
values less than 1.0.
|
The decline in LR values for larger values of
probably arises from trees with a small variance in total path lengths across the tips. Consider in Figure 1b if there were very little difference among species in total path lengths. In the limit if all species have the same path length, the plot will produce a horizontal line. As this limit is approached the directional model offers less and less improvement on the nondirectional model, eventually declining to zero. At the same time, as the limit is approached, the x = β* n1/
curve is required to turn an increasingly sharp corner, requiring higher values of
. In support of this conjecture, we find that for the 24,336 results plotted in Figure 3, the correlation between the variance in path lengths and LR is 0.48 (P < 0.0001).
Tree Shape
The shape of the tree, at least as revealed by the three measures we employed, did not influence the probability of finding a significant association between path lengths and nodes. The r2 values relating the likelihood-ratio to the Ic, B1, and steminess scores were 0.008, 0.001, and 0.015, respectively. This may reflect that randomly generated trees of size n = 50 tend to be relatively homogeneous. Colless' Ic statistic, for example, varies between 0 (perfectly balanced tree) and 1 (pectinate or ladder tree). In our sample, the mean Ic was 0.12 ± 0.03—most trees were relatively balanced.
In the limits, a perfectly balanced tree cannot suffer from the node-density artifact because all paths from the root to the tips traverse the same number of nodes. At the other extreme, a pectinate tree has the potential to show a large effect. However, the same simulated topology often gave qualitatively different results in our study, depending upon the parameters used to generate the data. Figure 4 shows a single simulated tree with an Ic score of 0.15. The tree has seven independent clades in which node density varies in a pectinate-like manner. It returned the highest LR statistic we observed for data simulated with an
-shape parameter of 0.05 and a root-to-tip tree length of 2.16. With an
-shape parameter of 3.36 and a length of 0.71, the same tree returned one of the lowest observed LRs. The node-density artifact is not confined to highly imbalanced or poorly sampled trees but can arise whenever the true amounts of change are underestimated.
|
| Discussion |
|---|
|
|
|---|
Using the
> 1 criterion in conjunction with a significant regression of path lengths on nodes, the delta test correctly identified the node-density artifact in 94.4% of the simulated data sets in which it was present. When the artifact was absent, the test had a type I error rate of about 1.7%. This makes it a useful statistic for identifying cases in which inferred branch lengths may suffer from the systematic bias to which Fitch and Bruschi (1987) and Fitch and Beintema (1990) first called attention. It can be used as a general phylogenetic diagnostic tool, and for other cases in which it is important first to rule out the artifact, such as reconstructing ancestral states or calculating molecular clocks. Out of historical interest, we applied the delta test to the Fitch and Bruschi and Fitch and Beintema trees. Both return significant relationships between nodes and path lengths, and both have
estimated to be greater than 1 (Fitch and Bruschi's tree LR = 25.99 and
= 1.54, Fitch and Beintema's tree LR = 8.33 and
= 1.66).
Webster et al. (2003) introduced the delta test in their study of speciation rates affecting rates of molecular evolution. These authors analysed whether higher speciation rates—as evidenced by a larger number of internal nodes along a path—were associated with greater amounts of overall genetic change. The delta test was used to identify trees in which an apparent relationship between rates of speciation and path lengths could have arisen as a result of the node-density effect. After removing trees with significant regressions and
> 1, these authors found evidence for higher rates of molecular evolution linked to speciation in 34.8% of the trees that remained (this figure is 28.2% when nodes are counted as in this paper, see Methods).
Commenting on the Webster et al. study, Witt and Brumfield (2004) suggested that
1 is compatible with the artifact and cited the Fitch and Bruschi (1987) tree as an example. Mathematically
1 is not compatible with the artifact (see Webster et al., 2004, in reply), and our simulations support this: when the node-density artifact is present, values of
1 arise only around 5% of the time, and then as a result of chance variation. Had Witt and Brumfield analyzed the Fitch and Bruschi tree, they would have discovered (see above) that it reveals the predicted
> 1, despite appearing to produce a linear relationship between path lengths and nodes. This emphasizes the importance of applying phylogenetically based statistics to this problem.
Some authors have suggested that maximum likelihood inference is robust to the node-density effect because it uses a substitutional model of evolution (Bromham, 2003; Bromham and Penny, 2003; Bromham et al., 2002). Maximum likelihood methods are expected to perform far better than parsimony methods in reconstructing change along branches by allowing multiple changes, whereas parsimony can only "see" at most one. But as our results here and others (e.g., Zharkikh, 1994) have shown, even likelihood methods will underestimate the true amount of change, especially when the wrong model of sequence evolution is used to analyze the data. Yang (1994, 1996) and Pagel and Meade (2004, 2005) note that tree lengths often increase when more realistic models of sequence evolution are applied. Better fitting models of sequence evolution should reduce the strength of any observed relationship between nodes and path lengths, and this could be easily assessed by comparing the
values for trees inferred from different models.
Molecular sequence data are likely to harbor complex signals of their evolutionary history. Detecting, characterizing, and interpreting these signals using statistical methods is a powerful way to reconstruct the past (Pagel, 1997, 1999). The results we report here show that it is possible to detect phylogenies that display an artifact of phylogeny reconstruction that can bias inferences about such historical evolutionary events.
Appendix 1. Estimating From Path Length and Nodes Data
|
|---|
|
|
|---|
Webster et al. (2003) fit a curve of the form n = β x
to detect the node-density artifact, where n is the number of nodes, x is the phylogenetic path length, β describes that rate of change between path length and the number of nodes, and
captures any curvature. This is algebraically equivalent to x = β* n1/
, where β* = β– 1/
, and we expect
> 1 when the artifact is present. When the data do not suffer from the artifact, there can still be a relationship between path lengths and nodes such that β* > 0, but
1.
In practice β* and
can in some cases be tricky to estimate owing to vagaries of path length and nodes data. It is essential that the parameters are estimated controlling for the phylogenetic relationships among taxa (see Webster et al., 2003-Supplementary Information). In addition, we find that the parameters are normally more accurately estimated from n = β x
than from the equivalent regression of path length on nodes. This is especially true when
as estimated from n = β x
is less than 1.0. But there are exceptions and it is easiest to see from examples of nodes versus reconstructed path lengths how some of the estimation problems arise. Fortunately all of them can be resolved from viewing a plot of the data.
True
> 1
Figures A1a and A1b plot the same phylogenetic data first as nodes versus path lengths and second as path lengths versus nodes. Estimating
from n = β x
(Fig. A1a) yields
= 3.25, and estimating it from x = β* n1/
(Fig. A1b) yields
= 3.27. The estimated regression lines are drawn through the data. In general, when
> 1, it makes little difference which equation is used to estimate it (but see below for
>> 1). Nevertheless, we prefer the n = β x
equation on the assumption that in real data, path lengths will tend to be better estimated than the number of nodes (representing speciation events), and it is well known that regression models underestimate parameters to the extent that there is error in the independent variable.
|
True
>> 1An exception to the rule of using n = β x
can arise for trees that produce a large
. This can occur in short trees or trees with little rate heterogeneity. Figure A2a plots nodes versus reconstructed path lengths for a tree of 50 tips. The relationship is curvilinear with
>> 1. Estimating
from n = β x
yields the starkly incorrect line shown, with
from x = β* n1/
as in Figure A2b.
|
True
< 1We do not expect a true
< 1 in data with the node-density artifact, but
< 1 can arise when the artifact is absent. When the true
is less than 1, it may be estimated poorly from x = β* n1/
even giving the impression that the artifact is present (i.e.,
> 1). Figure A3a plots data from a real phylogeny for which by inspection it can be seen that
< 1. Estimating
from n = β x
yields
from Figure A3b according to x = β* n1/
yields
|
The second fitting procedure returns a worse log-likelihood and fails in this case because of an unusual feature of nodes data. Node numbers vary in discrete jumps, and most trees will have a range of path lengths for the same number of nodes. These two features cause the discretely spaced vertical stacks of data in Figure A3b. As with the previous example, an upwards curving line drawn through such data can have long vertical deviations from the points, and this tendency becomes more prominent the steeper the line. When this occurs, it is often the case that the maximum likelihood estimator is a downwards curving line, such as the one obtained for these data, and for the same reasons as given above. Estimating
from n = β x
avoids this problem. It also uses path lengths on the x-axis and these are likely to be better estimated that numbers of nodes. | Acknowledgements |
|---|
This work was supported by BBSRC G19848 [GenBank] and a BBSRC Studentship to C.V. Tom Kirkman kindly modified his computer program to implement the Kolmogorov-Smirnov test and calculate the cumulative distribution frequency.
| References |
|---|
|
|
|---|
-
Agapow P. M., Purvis A. Power of eight tree shape statistics to detect nonrandom diversification: A comparison by simulation of two models of cladogenesis. Syst. Biol. (2002) 51:866–872.
Bromham L. Molecular clocks and explosive radiations. J. Mol. Evol. (2003) 57(Suppl 1):S13–S20.[CrossRef][Web of Science][Medline]
Bromham L., Penny D. The modern molecular clock. Nat. Rev. Genet. (2003) 4:216–224.[CrossRef][Web of Science][Medline]
Bromham L., Woolfit M., Lee M. S., Rambaut A. Testing the relationship between morphological and molecular rates of change along phylogenies. Evol. Int. J. Org. Evol. (2002) 56:1921–1930.
Fitch W. M., Beintema J. J. Correcting parsimonious trees for unseen nucleotide substitutions: The effect of dense branching as exemplified by ribonuclease. Mol. Biol. Evol. (1990) 7:438–443.[Abstract]
Fitch W. M., Bruschi M. The evolution of prokaryotic ferredoxins—with a general method correcting for unobserved substitutions in less branched lineages. Mol. Biol. Evol. (1987) 4:381–394.[Abstract]
Pagel M. Inferring evolutionary processes from phylogenies. Zool. Scripta (1997) 26:331–348.[CrossRef][Web of Science]
Pagel M. Inferring the historical patterns of biological evolution. Nature (1999) 401:877–884.[CrossRef]
Pagel M., Meade A. A phylogenetic mixture model for detecting pattern-heterogeneity in gene sequence or character-state data. Syst. Biol. (2004) 53:571–581.
Pagel M., Meade A. Mixture models in phylogenetic inference. In: Mathmatics of evolution and phylogeny—Gascuel O., ed. (2005) New York: Oxford Univiversty Press. 121–139.
Rambaut A. PhyloGen: Phylogenetic tree simulator package (2002) Department of Zoology, University of Oxford. version 1.1.
Rambaut A., Grassly N. C. Seq-Gen: An application for the Monte Carlo simulation of DNA sequence evolution along phylogenetic trees. Comput. Appl. Biosci. (1997) 13:235–238.
Rohlf F. J., Chang W. S., Sokal R. R., Kim J. Y. Accuracy of estimated phylogenies: Effects of tree topology and evolutionary model. Evolution (1990) 44:1671–1684.[CrossRef][Web of Science]
Swofford D. L. Phylogenetic analysis using parsimony (2001) Sunderland, Massachusetts: Sinauer Associates. PAUP*: (*and other methods), version 4.0b10.
Webster A. J., Payne R. J., Pagel M. Molecular phylogenies link rates of evolution and speciation. Science (2003) 301:478.
Webster A. J., Payne R. J., Pagel M. Response to comments on "Molecular phylogenies link rates of evolution and speciation". Science (2004) 303:173d–174d.
Witt C. C., Brumfield R. T. Comment on "Molecular phylogenies link rates of evolution and speciation" (I). Science (2004) 303:173. author reply 173.
Yang Z. Maximum likelihood phylogenetic estimation from DNA sequences with variable rates over sites: Approximate methods. J. Mol. Evol. (1994) 39:306–314.[CrossRef][Web of Science][Medline]
Yang Z. Among-site rate variation and its impact on phylogenetic analyses. Trends Ecol. Evol. (1996) 11:367–372.[CrossRef]
Zharkikh A. Estimation of evolutionary distances between nucleotide sequences. J. Mol. Evol. (1994) 39:315–329.[CrossRef][Web of Science][Medline]
This article has been cited by other articles:
![]() |
D. C. Marshall Cryptic Failure of Partitioned Bayesian Phylogenetic Analyses: Lost in the Land of Long Trees Syst Biol, November 17, 2009; (2009) syp080v1. [Abstract] [Full Text] [PDF] |
||||
![]() |
C. Venditti, A. Meade, and M. Pagel Phylogenetic Mixture Models Can Reduce Node-Density Artifacts Syst Biol, April 1, 2008; 57(2): 286 - 293. [Abstract] [Full Text] [PDF] |
||||
![]() |
S. Ekman, H. L. Andersen, and M. Wedin The Limitations of Ancestral State Reconstruction and the Evolution of the Ascus in the Lecanorales (Lichenized Ascomycota) Syst Biol, February 1, 2008; 57(1): 141 - 156. [Abstract] [Full Text] [PDF] |
||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||







