Skip Navigation

Systematic Biology 2004 53(4):623-637; doi:10.1080/10635150490503035
This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (PDF) Freely available
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in ISI Web of Science
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrow Search for citing articles in:
ISI Web of Science (34)
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Ho, S. Y.W.
Right arrow Articles by Jermiin, L. S.
Right arrow Search for Related Content
PubMed
Right arrow Articles by Ho, S. Y.W.
Right arrow Articles by Jermiin, L. S.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?

© 2004 Society of Systematic Biologists

Tracing the Decay of the Historical Signal in Biological Sequence Data

Edited by Peter Lockhart: Associate Editor

Simon Y.W. Ho1,3 and Lars S. Jermiin1,2

1 School of Biological Sciences, University of Sydney NSW 2006 Australia
2 Sydney University Biological Informatics and Technology Centre, University of Sydney NSW 2006 Australia


    Abstract
 Top
 Notes
 Abstract
 Materials and Methods
 Results
 Case Study: β-Tubulin in...
 Discussion
 Acknowledgment
 References
 
Alignments of nucleotide or amino acid sequences may contain a variety of different signals, one of which is the historical signal that we often try to recover by phylogenetic analysis. Other signals, such as those arising due to compositional heterogeneities, among-lineage and among-site rate heterogeneities, invariant sites, and covariotides, may interfere adversely with the recovery of the historical signal. The effect of the interaction of these signals on phylogenetic inference is not well understood and may, in many cases, even be underappreciated. In this study, we investigate this matter and present results based on Monte Carlo simulations. We explored the success of four phylogenetic methods in recovering the true tree from data that had evolved under conditions where the equilibrium base frequencies and substitution rates were allowed to vary among lineages. Seven scenarios with increasingly complex conditions were investigated. All of the methods tested, with the exception of neighbor-joining using LogDet distances, were sensitive to compositional convergence in nonsister lineages. Maximum parsimony was also susceptible to attraction between long edges. In many cases, however, phylogenetic inference methods can still recover the true tree when misleading signals are present, in some instances even when the historical signal is no longer dominant. These results highlight the growing need for simple methods to detect violation of the phylogenetic assumptions.

Keywords: Compositional heterogeneity; edge lengths; Monte Carlo simulation; networks; phylogenetic signal; rate heterogeneity; substitutional saturation

Received August 25, 2003; Revised December 14, 2003; Accepted March 27, 2004


Inferring phylogenetic trees is an important task in studies of molecular evolution and systematics, and many methods are now available for this purpose; each method has its own assumptions, merits, and shortcomings. For all of the methods, however, errors can arise in the inferred edge lengths and sometimes also in the inferred topology. Errors in estimates of edge lengths and topology have sometimes been viewed as separate problems, with the latter assigned greater importance to systematic and taxonomic studies, and the former to divergence date estimation. This dichotomous treatment of phylogenetic errors is not justified because topological inference relies on estimates of edge lengths, particularly of the internal edges; hence, the two issues are the obverse and reverse of the same ‘phylogenetic coin.’

Phylogenetic inference using sequence data entails the recovery of a historical signal, from which we can estimate the phylogeny. The historical signal is, however, only one of several signals; other signals, such as those arising from compositional heterogeneity (Lockhart et al., 1992), rate heterogeneity among lineages (Felsenstein, 1978) and among sites (Yang, 1993), invariant sites (Fitch, 1986b), and covariotides or covarions (Fitch, 1986a), may interfere adversely with the extraction of the historical signal. In fact, any signal that has experienced convergence in nonsister lineages will affect recovery of the historical signal. When the historical signal has been eroded to the extent that it is no stronger than the other confounding signals, then the methods for inferring evolutionary distances and topologies are likely to yield biased estimates (Swofford et al., 1996).

The property of statistical consistency (Felsenstein, 1978), which refers to the ability of a given method to converge on the correct tree with an increasing amount of data (i.e., alignment length), has become a distinguishing criterion between ‘good’ and ‘bad’ methods, along with other features such as efficiency, power, and robustness (Penny et al., 1992). In practice, however, phylogenetic data comprise sequences of finite length, and notions of statistical consistency, although theoretically desirable, may not always represent the most practical criterion for assessing phylogenetic methods (Hillis et al., 1994b; Kim, 1996; Farris, 1999).

Inconsistency can arise when certain combinations of evolutionary parameters are not adequately accounted for in the process of phylogenetic inference (e.g., Felsenstein, 1978; Hendy and Penny, 1989; Lockhart et al., 1994). Accordingly, under ‘simple’ conditions (i.e., the implicitly or explicitly assumed conditions), phylogenetic methods are typically consistent, and vice versa (Hillis et al., 1994a). However, reducing the problem into this dichotomy can tempt us into overlooking the true complexity of the issue. Phylogenetic methods have customarily been judged by their success in recovering the topology of the true tree, often with disregard for the accuracy of the inferred edge lengths. In fact, it is in the latter that we can trace the symptoms of decay in the strength of the historical signal. For instance, in the ‘gray’ area, where parameter values are intermediate between the ‘simple’ and ‘complex’ conditions, it remains possible to recover the topology of the true tree, but the confounding effects of nonhistorical signals may mislead estimates of edge lengths and consequently reduce the confidence that can be placed on the topology that is inferred.

A well-known case of phylogenetic methods being misled by a confounding signal is that of ‘long-edge attraction’ (Felsenstein, 1978; Hendy and Penny, 1989; Jermiin et al., 2003). Long edges on a phylogenetic tree can arise as a result of rate heterogeneity among lineages (Felsenstein, 1978), or in the analysis of noncontemporaneous sequences (Felsenstein, 1978; Li et al., 1988). Maximum parsimony is clearly misled by model misspecification in these cases, but Hendy and Penny (1989) showed that long-edge attraction might occur even when the evolutionary model fits the data well. There is now also evidence that maximum-likelihood and distance methods can be affected by rate heterogeneity among lineages when inappropriate substitution models are used (Hillis et al., 1994b; Gaut and Lewis, 1995; Chang, 1996; Lockhart et al., 1996). Based on simulation studies using various methods of phylogenetic inference, the problem is now known to be able to occur not only in quartets of sequences (e.g., Hillis and Huelsenbeck, 1993; Tateno et al., 1994; Gaut and Lewis, 1995; Jermiin et al., 2003), but also in data sets involving many more sequences (Hendy and Penny, 1989; DeBry, 1992; Hendy and Charleston, 1993; Zharkikh and Li, 1993; Kim, 1996; Conant and Lewis, 2001; Pol and Siddall, 2001; Holland et al., 2003). Convergence due to similar nucleotide composition in nonsister lineages can also mislead inference methods, in a manner similar to that of long-edge attraction. Cogent examples of this problem in empirical data sets have been presented previously (Hasegawa and Hashimoto, 1993; Lockhart et al., 1994; Chang and Campbell, 2000; Tarrío et al., 2001), and it has been shown by Monte Carlo simulation (Galtier and Gouy, 1995; Jermiin et al., 2004). These concerns, however, have not weighed heavily in past phylogenetic studies, as pointed out by Mooers and Holmes (2000), and may in fact be far more serious than previously thought (Jermiin et al., 2004). Methods hitherto developed to accommodate compositional variation have not been widely accepted or applied, and even the more recognized methods, such as the LogDet or paralinear distance correction (Lake, 1994; Lockhart et al., 1994; Steel, 1994), are limited by their assumptions and certain aspects of their design. RY-recoding of sequence data has also been investigated as an option (Phillips et al., 2001), but this has not yet been studied in detail. The maximum-likelihood methods (Yang and Roberts, 1995; Galtier and Gouy, 1998), on the other hand, require the estimation of many parameters, and run the risk of overparameterization, a problem to which short edges may be particularly susceptible.

Previous studies have addressed the abovementioned confounding factors as separate problems. In real sequence data, however, there can be complex interactions between them, with an outcome that is difficult to predict. In this paper, we present the results of simulation studies that explore the interplay between compositional heterogeneity and rate variation among lineages. In order to gain an insight into the complex problems this can pose for phylogenetic analysis, we evaluate the performance of commonly used tree inference methods in relation to the strength of the historical signal in the sequence data.


    Materials and Methods
 Top
 Notes
 Abstract
 Materials and Methods
 Results
 Case Study: β-Tubulin in...
 Discussion
 Acknowledgment
 References
 
Data sets were generated using the computer program Hetero (Jermiin et al., 2003), which simulates the evolution of a nucleotide sequence on a rooted binary tree with four terminal nodes (see Fig. 1). Each simulation was based on a tree with prespecified edge lengths: t = ta = tb = tc = td and te = tf = 0.01. Unlike other available simulation programs, Hetero allows the user to specify a model of substitution, Ri, for each edge i, along with other parameters. We used a restricted substitution model, with all 12 conditional rates of change set at the same value, to generate the sequences:


Formula

where {alpha}ixy is the conditional rate of change from nucleotide x to nucleotide y, measured in average substitutions per site per unit time, and {pi}iy is the frequency of nucleotide y. Every simulation began with a randomly generated nucleotide sequence with uniform nucleotide content.


Figure 1
View larger version (3K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Figure 1 The rooted tree used by Hetero (Jermiin et al., 2003) to simulate sequence evolution in order to produce alignments of sequences A, B, C, and D. Base frequencies are specified for the root node (O), and were set to 0.25 for each of the four nucleotides. Each edge, i, is assigned its own rate matrix, Ri. Edge lengths (ti) are specified in units of time, unlike in most previous programs that have used average substitutions per site; it is necessary to observe this distinction when the nucleotide content is not stationary across the whole tree (for an explanation of this point, see Jermiin et al., 2003).

 
The output of Hetero includes the alignments, user-specified parameters, and other characteristics of the simulation, including the pairwise differences in GC content among the four sequences. Hetero also produces a table with the proportion of sites that have changed X times (X = 0, 1, 2,...), and a numerical summary of the number of sites supporting each of the parsimoniously informative splits for an unrooted tree with four leaves (i.e., AB|CD, AC|BD, and AD|BC).

Seven cases of interest were explored by Monte Carlo simulation, including a null case where all of the fundamental phylogenetic premises were met. The remaining six cases involved converging nucleotide contents or rate heterogeneity among lineages, or combinations of the two, along different edges. Further details about each case are provided in the next section. A sequence length of 10,000 bp was used in all simulations, in order to test the performance of methods when there are abundant data. Increasing the sequence length reduces the effect of fluctuations associated with realizations of a stochastic simulation process. In reality, the sequences being analyzed are usually shorter than this, particularly given the possible presence of invariant sites, but our aim is to provide a clear picture of signal decay; using shorter sequences might simply make it more difficult to draw reliable conclusions.

In each case, simulated data sets were analyzed using maximum-parsimony (Fitch, 1971), maximum-likelihood with the F81 model of nucleotide substitution (Felsenstein, 1981), and neighbor-joining (Saitou and Nei, 1987) methods, as implemented in the Phylogenetic Inference Package (Felsenstein, 2002). The program weighbor (Bruno et al., 2000) was also used for weighted neighbor-joining analyses. Distance matrices for each of the neighbor-joining methods were estimated using the nucleotide substitution model of Jukes and Cantor (1969) and the LogDet metric (Lockhart et al., 1994; Steel, 1994). For data sets where one or more observed pairwise distances exceeded the maximum allowed by the models (and were thus undefined), which sometimes occurred at large values of t, offending alignments were removed prior to inference by the distance methods.

For three of the simulation cases, we also used the output of Hetero to draw three-dimensional networks (i.e., connected graphs with one or more cycles) to represent the available information (including both historical and conflicting signals) found in the sequence alignments. Networks are used in place of trees for situations where trees cannot adequately illustrate the evolutionary process, such as in the presence of species hybridization or horizontal genetic transfers, but can also be useful for representing phylogenetic ambiguities (Bandelt and Dress, 1992). For each network, edges were drawn with lengths proportional to the amount of support for it in the sequence alignment: internal edge lengths reflect the proportion of sites supporting phylogenetically informative splits, whereas terminal edge lengths are based on the proportion of corresponding singleton sites.


    Results
 Top
 Notes
 Abstract
 Materials and Methods
 Results
 Case Study: β-Tubulin in...
 Discussion
 Acknowledgment
 References
 
Case I—The Null Case
In this case, the nucleotide sequences were allowed to evolve at the same constant rate and with base compositions similar to that of the ancestral sequence (O; Fig. 1). Hence, the sequences evolved under homogeneous and stationary conditions (i.e., those that most of the phylogenetic methods assume).

The four phylogenetic methods perform comparably when their assumptions are met (Fig. 2). As t increases, the historical signal becomes increasingly faint, and the probability of inferring the correct tree topology converges on 33%, corresponding to the point where the historical signal has been lost completely. At that point, the sequences are said to be phylogenetically uninformative (and most aptly represented by a star phylogeny with very long edge lengths), and a tree is effectively being selected at random.


Figure 2
View larger version (19K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Figure 2 The ability of four phylogenetic methods to infer the correct tree from nucleotide sequences generated by Monte Carlo simulation using the tree in Fig. 1. For this case, we assigned identical Ri matrices (with {alpha}ixy = 0.4 and {pi}iy = 0.25) to the six edges to produce data sets with 1000 alignments of 10,000 nucleotides. The data sets were then analyzed using the phylogenetic methods described in Methods and Materials. The expected tree is illustrated above the five panels, followed by four panels showing how frequently the correct tree topology is inferred by the following methods, respectively: maximum parsimony (MP); maximum likelihood (ML); neighbor-joining using Jukes-Cantor (NJ-JC) and LogDet (NJ-LD) distances; and weighted neighbor-joining (with down-weighting of large distances in the matrix) using Jukes-Cantor (WJ-JC) and LogDet (WJ-LD) distances. Note that the points and curves for NJ-JC and NJ-LD coincide; the same applies to WJ-JC and WJ-LD. In the analyses using maximum likelihood, we used the F81 model of nucleotide substitution, corresponding to the model used in generating the data. The dashed horizontal line represents the probability of randomly choosing the correct tree (33%). The fifth panel shows the relative proportions of sites supporting each of the three parsimoniously informative splits.

 
In order to appreciate what underpins the decay of the historical signal, it is necessary to assess the relative proportions of parsimoniously informative splits. For very low values of t (<0.5), the relative proportion of sites supporting the correct split (i.e., AB|CD) is much larger than those supporting the alternative splits (i.e., AC|BD and AD|BC); but as t becomes larger, support for the alternative splits approaches the level of support for the correct split (Fig. 2). The primary reason for this change is that the fraction of sites that have changed repeatedly (multiple hits) increases with the size of t (Fig. 3); percentages of sites that have experienced more than one change are 12.3% for t = 0.5; 33.7% for t = 1.0; 69.0% for t = 2.0; and 87.3% for t = 3.0. Thus, it is not surprising that as t increases, levels of support for the two incorrect splits become increasingly similar to the support for the true split.


Figure 3
View larger version (10K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Figure 3 Multiple-substitution profiles for sequences generated under null conditions, with terminal edges equivalent in length. Each graph shows the frequency (%) of sites that have changed X times, where X = 0, 1, ..., 10. Even after a short time (t = 0.5), some sites experienced up to four changes. When terminal edges had evolved for t = 3 time units, producing a tree with a total length of 3.606 average substitutions per site, some sites had experienced up to 12 substitutions (not on chart scale).

 
The effect of a decaying historical signal on estimates of the length of internal edges is shown in Fig. 4. For each of the phylogenetic methods and each of the splits, the estimated length of the internal edges increases with the size of t from the expected values to values that are several times larger. Interestingly, the bias appears to be similar for the three splits but notably different among the phylogenetic methods. In particular, maximum likelihood substantially overestimates the lengths of all three possible internal edges.


Figure 4
View larger version (15K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Figure 4 Estimates of internal edge lengths as a function of the terminal edge length. For each simulation, we estimated the average internal edge length for the correct split (i.e., AB|CD) and for the incorrect splits (i.e., AC|BD and AD|BC); these estimates were obtained by maximum parsimony (MP), maximum likelihood (ML) using the F81 model, and neighbor-joining with JC-corrected distances. We set a uniform nucleotide content and a transition-transversion ratio of 0.5 for estimation by maximum likelihood and neighbor-joining. The dashed horizontal lines correspond to the expected length of the internal edge in the correct split (i.e., AB|CD); the expected lengths of the internal edges in the incorrect splits (i.e., AC|BD and AD|BC) equal 0.0.

 
In the absence of other factors confounding phylogenetic inference, multiple nucleotide substitutions at the same site is the principal factor contributing to the erosion of the historical signal. Even with an average rate of 0.4 substitutions per site per unit time, the historical signal is quickly eroded; after 0.5 time units (equivalent to 0.2 average substitutions per site), 12.3% of the sites had undergone multiple substitutions. For real coding sequences, however, this proportion is very likely to differ from this value, because the evolutionary rate may vary among sites due to codon structure and other selective constraints.

Bearing in mind the results presented above, we are now ready to examine scenarios in which the phylogenetic assumptions are contravened by the data. Again, we will use Hetero to generate the sequences, but henceforth we will change the parameters such that one or more of the phylogenetic assumptions are violated. By choosing the parameters carefully, we will be able to directly compare the results from the following six cases with those obtained from the null case.

Case II—Rate Heterogeneity among Sister Lineages
In this case, the nucleotide sequences were allowed to evolve with base compositions similar to that of the ancestral sequence (O; Fig. 1), but the average rates of change along edges a and d were set to be 1.6 times those along edges e and f, and the average rates of change along edges b and c were set to be 0.4 times those along edges e and f. Thus, there was a fourfold difference among the substitution rates in the terminal edges. The sequences can be said to have evolved under heterogeneous but stationary conditions (i.e., those that many of the phylogenetic methods are not designed to account for).

The ability of the maximum-parsimony method to recover the true tree declined rapidly as the value of t increased (Fig. 6); this is the classical example of a long-edge attraction effect. The performance of maximum-likelihood (using the F81 substitution model) and distance methods (using either JC- or LogDet-corrected distances) was better than that of the maximum-parsimony method, but worse than in the null case (Fig. 2). The unusual pattern seen in the accuracy of the maximum-likelihood method is perhaps due to the ability of the method to compensate for the vastly increased edge lengths of a and d arising from the elevated rates in these edges, but may also be related to the inferred internal edge lengths (Fig. 4).


Figure 5
View larger version (10K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Figure 5 The ability of four phylogenetic methods to infer the correct tree from nucleotide sequences generated by Monte Carlo simulation using the tree in Fig. 1. In this study, we used different Ri matrices (for Ra and Rd, {alpha}axy = {alpha}dxy = 0.4, {pi}aA = {pi}aT = {pi}dA = {pi}dT = 0.4 and {pi}aC = {pi}aG = {pi}dC = {pi}dG = 0.1; for Rb and Rc, {alpha}bxy = {alpha}cxy = 0.4, {pi}bA = {pi}bT = {pi}cA = {pi}cT = 0.1 and {pi}bC = {pi}bG = {pi}cC = {pi}cG = 0.4; and for Re and Rf, {alpha}exy = {alpha}fxy= 0.4 and {pi}ey = {pi}fy = 0.25) to produce data sets with 1000 alignments of 10,000 nucleotides. Figure details and methods of analyses follow Fig. 2.

 
It is remarkable that the maximum-likelihood and distance methods (using either JC- or LogDet-corrected distances) were able to recover the tree with such accuracy, because the proportion of sites supporting an incorrect split (i.e., AD|BC) exceeded that supporting the correct split (i.e., AB|CD) for t ≥ 0.4. (Fig. 6). Apparently, correction for multiple substitutions at the same site has the capacity to alleviate the effect of rate heterogeneity among the lineages. Although this case has been studied extensively (e.g., Hillis et al., 1994a; Gaut and Lewis, 1995; Swofford et al., 2001; Jermiin et al., 2003), it appears that this has not been explicitly enunciated.

Case III—Compositional Heterogeneity among Sister Lineages
In this case, the sequences were all set to evolve at the same conditional rates of change, but with differences in the nucleotide content arising as the sequences evolved along the different edges. The nucleotide content along edges e and f was set to match that of the ancestral sequence (O; Fig. 1), whereas the GC content was set to decrease along edges a and d, and to increase along edges b and c. Hence, the sequences evolved under heterogeneous and non-stationary conditions (i.e., those that many of the phylogenetic methods are not designed to account for).

As noted previously (Lockhart et al., 1994; Galtier and Gouy, 1995; Jermiin et al., 2004), phylogenetic methods can fail when the nucleotide composition converges in nonsister lineages. This was clearly the case for all of the methods assessed, except when the LogDet metric was used to estimate the distances between pairs of sequences (Fig. 7). As expected, the LogDet metric has the capacity to assuage the effect of compositional heterogeneity among the lineages, even for t ≥ 0.4, when the proportion of sites supporting an incorrect split (i.e., AD|BC) exceeds that supporting the correct split (i.e., AB|CD). Indeed, under those conditions, the method performed as well as it did in the null case. Interestingly, the accuracy of the maximum-likelihood method using the F81 substitution model increased from 0% to 10% as the value of t approached 3.5 (Fig. 7); we are unable to explain this increase but note that t = 3.5 corresponds to a situation where every site in the alignment has changed on the average 3.7 times and about 91.6% of the sites have changed more than once!


Figure 6
View larger version (16K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Figure 6 The ability of four phylogenetic methods to infer the correct tree from nucleotide sequences generated by Monte Carlo simulation using the tree in Fig. 1. For this case, we used different Ri matrices (for Ra and Rd, {alpha}axy = {alpha}dxy = 0.64 and {pi}ay = {pi}dy = 0.25; for Rb and Rc, {alpha}bxy = {alpha}cxy = 0.16 and {pi}by = {pi}cy = 0.25; and for Re and Rf, {alpha}exy = {alpha}fxy = 0.4 and {pi}ey = {pi}fy = 0.25) to produce data sets with 1000 alignments of 10,000nucleotides. These values were chosen because they ensured that the distribution of sites that have changed X times remained similar to those from the null case; in other words, we can compare the results from case II with those from the null case, and any difference between these two sets of results can be ascribed exclusively to the fourfold difference in the average rates of change along edges a and d versus b and c. Figure details and methods of analyses follow Fig. 2.

 
The joint effect of a decaying historical signal and compositional heterogeneity on the inference of topology and edge lengths is illustrated using networks (Fig. 5, case III). As t increases, the lengths of the internal and terminal edges converge on different values, with the terminal edges being approximately identical to each other in length, but with the internal edge lengths differing considerably. Simultaneously, it becomes increasingly clear that the raw data support an incorrect split (i.e., AD|BC). In view of this, it is noteworthy that the LogDet metric can still produce distances that will lead to inference of the correct binary tree.


Figure 7
View larger version (21K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Figure 7 Network representations of the split information contained in the sequence data, for times ranging from t = 0.1 to t = 3.5. Each network is based on the number of singleton sites and parsimoniously informative sites, which is given as output by the program Hetero; these numbers were then scaled according to the length of the alignment. Case I: Initially, the correct tree is inferred, but as t increases and multiple substitutions at the same sites accumulate, some sites begin to support nonhistorical splits (i.e., AC|BD and AD|BC). When substitutional saturation occurs to the extent that the historical signal is completely lost, the lengths of different internal edges are approximately equal, and the central prism representing the internal edges metamorphoses into a cube. Case III: As t increases, the historical signal is eroded by multiple substitutions at the same site. Concomitantly, the compositional signal is increased, and by the time t ≥ 0.4 the compositional signal is so strong that it dominates over the historical signal. At that stage the major split separates A and D from B and C, and this remains the case as the point of mutational saturation is approached (i.e., as t approaches 3.5). Case VI: As t increases, the internal edge supporting an incorrect split (i.e., AD|BC) grows in length, due to the rate signal produced by the elevated substitution rates in edges a and d compared with the lower rates in edges b and c. However, when t increases beyond 1.5, the length of the internal edge supporting the correct split (i.e., AB|CD) grows due to a compositional signal, which is consistent with the historical signal.

 
Our results are not in agreement with a number of previous studies, which claimed that only ‘extreme’ levels of compositional disparity can affect the accuracy of phylogenetic methods (Conant and Lewis, 2001; Rosenberg and Kumar, 2003). The correct tree could not be unfailingly recovered every time with conventional phylogenetic methods when differences in GC content (A and D versus B and C) exceeded 6.7%, and when they exceeded 10.7% (19.5%), it was impossible for the maximum-parsimony (maximum-likelihood) method to recover the correct tree. By contrast, use of the LogDet metric enabled neighbor-joining to recover the true tree over a wide range of differences in GC content.

Compositional disparities of the scale described above are not uncommon in mitochondrial sequence data, particularly among metazoan phyla (Saccone et al., 1999). An extreme example can be seen in the ATP synthase subunit 6 (ATP6) genes of rainbow trout (Oncorhynchus mykiss) and honeybee (Apis mellifera), which show a 49.6% difference in their GC content. Disparities of an even greater magnitude are seen among bacterial genomes, whose GC contents vary from 25% to 75% (Sueoka, 1964). Thus, this case may be regarded as a fairly realistic example.

Case IV—Rate and Compositional Heterogeneity among Sister Lineages
In this case, the nucleotide sequences were allowed to evolve in a complex manner with the average rates of change along edges a and d set to be 1.6 times those along edges e and f, and the average rates of change along edges b and c set to be 0.4 times those along edges e and f. Concurrently, the nucleotide sequences were allowed to accumulate a lower GC content along edges a and d and a higher GC content along edges b and c. The ancestral sequence (O; Fig. 1) had a uniform nucleotide content. The sequences, therefore, evolved under heterogeneous and nonstationary conditions (i.e., those that many of the phylogenetic methods are not designed to account for).

The combined impact of a decaying historical signal, rate heterogeneity among lineages, and compositional convergence on the ability of phylogenetic methods to infer the correct tree is illustrated in Fig. 8. The proportion of sites supporting one of the two incorrect splits (i.e., AD|BC) already exceeds that supporting the correct split (i.e., AB|CD) when t = 0.3 (Fig. 8), which is a smaller value than those observed in case II (t = 0.4; Fig. 6) and case III (t = 0.4; Fig. 7), so the convergences in substitution rates and in nucleotide compositions have a cumulative effect on our ability to recover the correct tree. This is reflected most strikingly in the results produced by the maximum-parsimony method, which is particularly sensitive to both of these confounding factors. Perhaps not as intuitively, however, the success of the maximum-likelihood method in recovering the true tree is greater than that seen in case III. Its accuracy initially falls almost to zero, but slowly increases with increasing terminal edge lengths. This behavior is probably due to the maximum-likelihood method accounting for the long edge lengths caused by the higher rates in edges a and d, after the base frequencies converge onto their specified equilibrium values. Neighbor-joining using LogDet distances shows a slight decrease in accuracy. The various results collectively imply that there may be grounds for elevated levels of concern when phylogenetic data have evolved under a more complex process of evolution, like the one studied here.


Figure 8
View larger version (19K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Figure 8 The ability of four phylogenetic methods to infer the correct tree from nucleotide sequences generated by Monte Carlo simulation using the tree in Fig. 1. For this case, we used different Ri matrices (for Ra and Rd, {alpha}axy = {alpha}dxy = 0.64, {pi}aA = {pi}aT = {pi}dA = {pi}dT = 0.4 and {pi}aC = {pi}aG = {pi}dC = {pi}dG = 0.1; for Rb and Rc, {alpha}bxy = {alpha}cxy = 0.16, {pi}bA = {pi}bT = {pi}cA = {pi}cT = 0.1 and {pi}bC = {pi}bG = {pi}cC = {pi}cG = 0.4; and for Re and Rf, {alpha}exy = {alpha}fxy = 0.4 and {pi}ey = {pi}fy = 0.25) to produce data sets with 1000 alignments of 10,000 nucleotides. Figure details and methods of analyses follow Fig. 2.

 
Case V—Rate and Compositional Homogeneity among Sister Lineages
In this case, the nucleotide sequences were allowed to evolve in a complex manner with the average rates of change along edges a and b set to be 1.6 times those along edges e and f, and the average rates of change along edges c and d set to be 0.4 times those along edges e and f. At the same time, the nucleotide sequences were allowed to accumulate a lower GC content along edges a and b and higher GC content along edges c and d. The ancestral sequence (O; Fig. 1) had a uniform nucleotide content. The sequences thus evolved under heterogeneous and non-stationary conditions (i.e., those that many of the phylogenetic methods are not designed to account for) that are different from those used in case IV.

Positive estimation biases, introduced by the convergences of rates and of base composition in sister lineages, led to the proportion of sites supporting the correct split (i.e., AB|CD) always exceeding those supporting the incorrect splits (i.e., AC|BD and AD|CB); accordingly, the accuracy of the phylogenetic methods was enhanced (Fig. 9), relative to their performance in the null case (Fig. 2).


Figure 9
View larger version (17K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Figure 9 The ability of four phylogenetic methods to infer the correct tree from nucleotide sequences generated by Monte Carlo simulation using the tree in Fig. 1. In this study, we used different Ri matrices (for Ra and Rb, {alpha}axy = {alpha}bxy = 0.64, {pi}aA = {pi}aT = {pi}bA = {pi}bT = 0.4 and {pi}aC = {pi}aG = {pi}bC = {pi}bG = 0.1; for Rc and Rd, {alpha}cxy = {alpha}dxy = 0.16, {pi}cA = {pi}cT = {pi}dA = {pi}dT = 0.1 and {pi}cC = {pi}cG = {pi}dC = {pi}dG = 0.4; and for Re and Rf, {alpha}exy = {alpha}fxy = 0.4 and {pi}ey = {pi}fy = 0.25) to produce data sets with 1000 alignments of 10,000 nucleotides. Figure details and methods of analyses follow Fig. 2.

 
The maximum-parsimony method recovered the correct tree every time, regardless of the level of sequence divergence; the suite of parameters conducive to this phenomenon has been variously referred to as the "Farris zone" (Siddall, 1998), the "anti-Felsenstein zone" (Bruno and Halpern, 1999), and the "inverse Felsenstein zone" (Swofford et al., 2001). When the evolutionary rates along sister lineages are increased in relation to the rates along the other lineages, the maximum-parsimony method initially outperforms the maximum-likelihood method, as shown in Fig. 9. However, Swofford et al. (2001) discovered that this effect disappears when more sites are added to the alignment.

Based on the Jukes-Cantor (JC) distances, the neighbor-joining and weighted neighbor-joining methods produced the correct phylogeny even for relatively divergent sequences; conversely, when distances based on the LogDet metric were used, the accuracy of the method dropped considerably. As in the previous case, convergences in substitution rates and in base frequencies have a cumulative effect on our ability to recover the correct tree, but the effect is the opposite of that seen in case IV. Again, however, it suggests that we should proceed with caution when phylogenetic data have evolved under a more complex process of evolution, like the one studied in this case.

Case VI—Rate Heterogeneity and Compositional Homogeneity among Sister Lineages
In this case, the nucleotide sequences were allowed to evolve in a complex manner with the average rates of change along edges a and d set to be 1.6 times those along edges e and f, and the average rates of change along edges b and c set to be 0.4 times those along edges e and f. Concurrently, the nucleotide sequences were allowed to accumulate a lower GC content along edges a and b and higher GC content along edges c and d. The ancestral sequence (O; Fig. 1) had a uniform nucleotide content. Hence, the sequences evolved under heterogeneous and nonstationary conditions (i.e., those that many of the phylogenetic methods are not designed to account for) that are different from those used in case IV and case V.

All of the phylogenetic methods were able to infer the correct tree with greater accuracy (Fig. 10) than they could under the null case (Fig. 2), a result attributable in part to the proportion of sites supporting the correct split (i.e., AB|CD) always exceeding those supporting the incorrect splits (i.e., AC|BD and AD|CB).


Figure 10
View larger version (18K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Figure 10 The ability of four phylogenetic methods to infer the correct tree from nucleotide sequences generated by Monte Carlo simulation using the tree in Fig 1. In this study, we used different Ri matrices (for Ra, {alpha}axy = 0.64, {pi}aA = {pi}aT = 0.4 and {pi}aC = {pi}aG = 0.1; for Rb, we used {alpha}bxy = 0.16, {pi}bA = {pi}bT = 0.4 and {pi}bC = {pi}bG = 0.1; for Rc, we used {alpha}cxy = 0.16, {pi}cA = {pi}cT = 0.1 and {pi}cC = {pi}cG = 0.4; for Rd, {alpha}dxy = 0.64, {pi}dA = {pi}dT = 0.1 and {pi}dC = {pi}dG = 0.4; and for Re and Rf, {alpha}exy = {alpha}fxy = 0.4 and {pi}ey = {pi}fy = 0.25) to produce data sets with 1000 alignments of 10,000 nucleotides. Figure details and methods of analyses follow Fig. 2.

 
For small values of t, the accuracy of the maximum-parsimony method initially declined due to attraction of the two sequences that had evolved at a higher rate (A and D), but as their nucleotide contents became increasingly different from one another, and more similar to those of B and C, respectively, the correct tree was inferred with a rising accuracy that eventually reached 100% at t = 3.2. Obviously, in this case, the convergence in nucleotide composition allowed the maximum-parsimony method to completely overcome the effect of long-edge attraction. The performance of distance-based methods was similar to that in case V, partly due to the concordance of the compositional signal with the correct tree. The sharp drop in the performance of the neighbor-joining method based on JC-corrected distances was due to the fact that, in all alignments, the observed distances between A and D reached values that transcended the expected upper limit for this model. The maximum-likelihood method also performed in a manner reminiscent of its performance in case V.

The impact of a decaying historical signal, combined with the effects of rate heterogeneity and compositional homogeneity among sister lineages, on the inference of topology and edge lengths is depicted by networks (Fig. 5, case VI). The ratio of the lengths of long edges to internal edges declines as t increases, which is a symptom of substitutional saturation. The internal edge supporting the correct split (i.e., AD|BC) increases in length compared with the remaining internal edges, reflecting the converging base compositions in sister lineages. To a lesser extent, one of the incorrect splits (i.e., AD|BC) also grows in length, due to higher substitution rates along the edges leading to A and D.

As in the previous two cases, convergences in substitution rates and in base frequencies have a cumulative impact on our ability to recover the correct tree, but the effect is different from that seen in case IV and similar to that seen in case V. This result demonstrates again the grounds for even greater concern when phylogenetic data have evolved under a more complex process of evolution, like the one studied in this case.

Case VII—Rate Homogeneity and Compositional Heterogeneity among Sister Lineages
In this case, the nucleotide sequences were allowed to evolve in a complex manner with the average rates of change along edges a and b set to be 1.6 times those along edges e and f, and the average rates of change along edges c and d set to be 0.4 times those along edges e and f. The nucleotide sequences were also allowed to accumulate a lower GC content along edges a and d and higher GC content along edges b and c. The ancestral sequence (O; Fig. 1) had a uniform nucleotide content. In other words, the sequences evolved under heterogeneous and non-stationary conditions (i.e., those that many of the phylogenetic methods are not designed to account for) that differ from those used in case IV, case V, and case VI.

Maximum-parsimony and neighbor-joining (both weighted and unweighted, using either JC- or LogDet-corrected distances) methods were able to infer the correct tree with greater accuracy (Fig. 11) than they could under the null case (Fig. 2), a result partly attributable to the proportion of sites supporting the correct split (i.e., AB|CD) initially exceeding the proportion supporting the incorrect splits (i.e., AC|BD and AD|CB).


Figure 11
View larger version (18K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Figure 11 The ability of four phylogenetic methods to infer the correct tree from nucleotide sequences generated by Monte Carlo simulation using the tree in Fig. 1. For this case, we used different Ri matrices (for Ra, we used {alpha}axy = 0.64, {pi}aA = {pi}aT = 0.4 and {pi}aC = {pi}aG = 0.1; for Rb, we used {alpha}bxy = 0.64, {pi}bA = {pi}bT = 0.1 and {pi}bC = {pi}bG = 0.4; for Rc, {alpha}cxy = 0.16, {pi}cA = {pi}cT = 0.1 and {pi}cC = {pi}cG = 0.4; for Rd, {alpha}dxy = 0.16, {pi}dA = {pi}dT = 0.4 and {pi}dC = {pi}dG = 0.1; and for Re and Rf, {alpha}exy = {alpha}fxy = 0.4 and {pi}ey = {pi}fy = 0.25) to produce data sets with 1000 alignments of 10,000 nucleotides. Figure details and methods of analyses follow Fig. 2.

 
Initially, the accuracy of the maximum-parsimony method was high due to attraction of the sequences evolving at a higher rate (A and B), but as their nucleotide contents became increasingly dissimilar to each other, and converged with those of D and C, respectively, the correct tree was inferred with declining accuracy. Here, the compositional convergence sent a stronger confounding signal than that provided by rate heterogeneity. The performance of distance-based methods was similar to that in case III because of the dominance of the compositional signal. Neighbor-joining using the Jukes-Cantor model performed poorly, due in part to the confounding compositional signal and in part to the large distances among sequences. The maximum-likelihood method was troubled by compositional convergence at smaller values of t, but from t = 1 its accuracy gradually increased, for reasons that are not entirely clear.

As in the previous three cases, convergences in substitution rates and in base frequencies have a cumulative impact on our ability to recover the correct tree, but the effect is different from that seen in case V and case VI, and similar to that seen in case IV. This result further suggests that when phylogenetic data have evolved under a more complex process of evolution, like the one studied in this case, we have ample reason to be concerned.


    Case Study: β-Tubulin in Metazoa
 Top
 Notes
 Abstract
 Materials and Methods
 Results
 Case Study: β-Tubulin in...
 Discussion
 Acknowledgment
 References
 
We present a data set in which nonhistorical signals appear to mislead phylogenetic inference methods. The microtubule element β-tubulin has been widely used in analyses of deep eukaryotic relationships, such as among eukaryotic kingdoms (Baldauf and Palmer, 1993; Edlind et al., 1996; Keeling and Doolittle, 1996; Li et al., 1996; Keeling et al., 2000; Simpson et al., 2002), and among metazoan groups (Schütze et al., 1999). It remains one of the genes of choice for studies of deep divergences, because of its availability for a diverse range of taxa and the relatively high level of sequence conservation at the amino acid level (Keeling and Doolittle, 1996). However, although the inferred trees have been in agreement with evidence from {alpha}-tubulin sequences, there are topological inconsistencies when they are compared with trees inferred from other molecular sources (Keeling and Doolittle, 1996).

Nucleotide sequences were obtained from GenBank for the β-tubulin genes of the tobacco budworm (Heliothis virescens; accession number U75868 [GenBank] ), fruit fly (Drosophila melanogaster; NM_079071 [GenBank] ), giant Pacific octopus (Octopus dofleini; L10111 [GenBank] ), mouse (Mus musculus; NM_011655 [GenBank] ), Norway rat (Rattus norvegicus; AB011679 [GenBank] ), and Chinese hamster (Cricetulus griseus; AF120325 [GenBank] ). Sequences were aligned by eye, and variable regions at the beginning and the end of the alignment were removed, leaving a data set of 1338 aligned base pairs (the alignment is available from http://www.bio.usyd.edu.au/~jermiin/). Phylogenetic trees were inferred using the maximum-parsimony method, and maximum-likelihood, neighbor-joining, and weighted neighbor-joining methods using the F84 (Felsenstein, 2002) substitution model.

Analyses of compositional variation at the three codon positions were performed by constructing a neighbor-joining tree from a matrix of Euclidean distances (Fig. 12), which summarized the compositional differences; the method was employed by Lockhart et al. (1994) in their illustration of the LogDet method. Although base frequencies are relatively uniform at first and second codon sites, even when invariable sites are excluded, there is substantial compositional heterogeneity at the third codon sites. In particular, the sequences of Heliothis and Octopus have converged in composition to the exclusion of Drosophila, even though Heliothis and Drosophila are almost certainly sister taxa in the true tree.


Figure 12
View larger version (9K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Figure 12 Neighbor-joining trees constructed entirely on the basis of compositional differences. Full species names are given in the text. Each entry dij in the distance matrix is the Euclidean distance between compositional vectors of taxa i and j, where the four elements in each vector are the frequencies of the bases A, C, G, and T (Lockhart et al., 1994). The three trees are drawn to the same scale. The third codon sites have the largest compositional difference, with the observed distance between Octopus and Drosophila equal to 0.23 (i.e., 16.2% of the maximum possible difference). The third codon sites from Heliothis and Octopus have converged in composition, particularly in relation to Drosophila.

 
A relative-rates test (Wu and Li, 1985; Muse and Weir, 1992) was conducted using the program K2WuLi (Andrews et al., 1998) in order to detect rate heterogeneity among the lineages. By setting the octopus as the outgroup, large rate differences were found (Table 1), with a long edge leading to Heliothis.


View this table:
[in this window]
[in a new window]

 
Table 1. Application of relative-rates tests for aligned sequences of β-tubulin

 
All of the phylogenetic methods tested (i.e., the maximum-parsimony method, and the maximum-likelihood, neighbor-joining, and weighted neighbor-joining methods using the F84 model of nucleotide substitution) failed to infer what is thought to be the true tree, instead pairing the lepidopteran and octopus together as sister taxa to the exclusion of the fruit fly (Fig. 13). The three species of rodents were correctly identified as constituting a monophyletic group. Even after RY-recoding of third codon sites, which exhibited the greatest degree of compositional heterogeneity, only the weighted neighbor-joining and maximum-likelihood methods succeeded in inferring the expected tree. In this case study, therefore, it was necessary to extricate both the rate and compositional signals before the true tree could be recovered.


Figure 13
View larger version (6K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Figure 13 (a) Tree inferred by neighbor-joining using distances under the F84 substitution model (Felsenstein, 2002). Full species names are given in the text. Edges leading to Octopus and Heliothis appear to be experiencing long-edge attraction. Other phylogenetic methods produced the same (incorrect) topology: maximum parsimony, maximum likelihood with empirical base frequencies, optimized transition/transversion ratio, and uniform rates across sites, and weighted neighbor-joining with the F84 model. The scale bar represents average substitutions per site. (b) Tree correctly inferred by weighted neighbor-joining using JC (Jukes and Cantor, 1969) distances, with RY-recoding of third codon sites. The same topology was inferred from the recoded data using maximum likelihood, but not with other methods.

 

    Discussion
 Top
 Notes
 Abstract
 Materials and Methods
 Results
 Case Study: β-Tubulin in...
 Discussion
 Acknowledgment
 References
 
The decay of the historical signal is an unavoidable problem when analyzing sequence data, unless the sequences are highly conserved or are sourced from very closely related species. This observation endorses previous warnings that phylogenetic methods will construct trees when requested, irrespective of whether any biologically meaningful information remains in the sequence data, or even if the data are completely random (Steel et al., 1993). However, an efficient and consistent method to determine the reliability of phylogenetic data is currently unavailable, posing an interesting problem, particularly to studies of ancient divergences. Huelsenbeck (1991) proposed that the skew of a tree length distribution might be a good indicator of phylogenetic information, but our research indicates that this approach might be inaccurate when the phylogenetic assumptions are violated; another technique is clearly needed.

The ability of phylogenetic methods to infer the correct tree diminishes when there is a discrepancy between (i) how the nucleotide sequences were generated through evolution and (ii) the nucleotide substitution model assumed in the analysis. Ideally, we would like sequence data to contain only a historical signal, but the reality is that compositional and rate signals are also commonly found in sequence data. The effect of each of these two signals is well known, but the present findings show that the interactions among the different signals in sequence data can be more problematic and cryptic than previously anticipated. The scenarios presented here are more complex than most of those communicated in previous studies, but are nevertheless simplifications of the processes acting upon real sequence data. Other factors, such as among-site rate variation, nonuniform conditional rates of change, and correlated evolution among sites, will undoubtedly affect recovery of the historical signal.

It is difficult to predict the outcome of the interplay between the confounding signals, even when split information has been investigated. This emphasizes the need for examining biases in edge length estimation, due to their fundamental impact on the accuracy of topological inference. In the networks presented above, evidence of the effects of confounding factors can be traced in the lengths of internal edges. It is apparent that support for different topologies makes gradual transitions, rather than abrupt switches between trees, as is seen when the sequence data are fitted to binary trees. The distinction between these two views cannot be made if edge lengths are ignored. That said, we cannot evaluate the phylogenetic tree on the basis of its appearance; instead, we must find ways to assess sequence alignments prior to phylogenetic analysis (a further implication stems from the similarity between some of the phylogenetic signal patterns that are produced through the interaction of confounding factors and those arising from reticulate evolution, which suggests that a thorough survey of the data is required prior to postulating the occurrence of lateral transfer events).

From a more general viewpoint, it appears that there are many estimation biases affecting phylogenetic analysis that are difficult to address, unless they are detected. Simple substitution models, although adequate in many cases, can only approximate real evolutionary processes to a certain degree. Models that are flexible enough to accommodate deviations from traditional assumptions are often parameter-rich, and are limited due to their computational intractability for larger data sets. Consequently, until the advent of greater processing power, we recommend that a rigorous investigation of the sequence data of interest be conducted prior to phylogenetic analysis. Preliminary surveying by compatibility (Jakobsen et al., 1997) and spectral analyses (Hendy, 1993; Lento et al., 1995), for example, can provide some indication of the presence and relative strength of the signals. It should also be noted that, in the presence of conflicting signals, optimality criteria can be preferable to algorithm-based approaches, because inspection of suboptimal trees can provide some insight into the interplay among conflicting signals (Swofford et al., 1996). Nonetheless, blind reliance on optimality criteria is also inadvisable, and it is only when confounding factors have been adequately considered that one can be in a position to make a judgment regarding the analytical approach that should be taken.

The present results have implications for how phylogenetic analyses should be conducted. The influence of the different signals is likely to vary among data sets; for example, some sequences will exhibit substantial disparities in base composition, whereas others may have more uniform nucleotide content. Bearing this in mind, there may not always be a straightforward explanation for why the application of a given method to real data produces a tree that is obviously incorrect. Choosing a phylogenetic method based on personal preference or prior experience is usually not the best approach; it is more appropriate to make an informed decision, based on a survey of the data using methods that can detect evidence of decay in the historical signal and violation of the phylogenetic assumptions.


    Acknowledgment
 Top
 Notes
 Abstract
 Materials and Methods
 Results
 Case Study: β-Tubulin in...
 Discussion
 Acknowledgment
 References
 
The authors would like to thank John Robinson, Peter Lockhart, David Penny, and Frédéric Delsuc for helpful comments on this paper. The research was partly funded by a Discovery Grant (DP0453173) from the Australian Research Council. S.Y.W.H. was supported by an A.E. and F.A.Q. Stephens Scholarship from the University of Sydney. This is research paper no. #004 from SUBIT.

Associate Editor: Peter Lockhart


    Notes
 Top
 Notes
 Abstract
 Materials and Methods
 Results
 Case Study: β-Tubulin in...
 Discussion
 Acknowledgment
 References
 
3 Present address: Henry Wellcome Ancient Biomolecules Centre, Department of Zoology, University of Oxford, OX1 3PS, United Kingdom; E-mail: simon.ho{at}zoo.ox.ac.uk Back


    References
 Top
 Notes
 Abstract
 Materials and Methods
 Results
 Case Study: β-Tubulin in...
 Discussion
 Acknowledgment
 References
 

    Andrews T. D., Jermiin L. S., Easteal S. Accelerated evolution of cytochrome b in simian primates: Adaptive evolution in concert with other mitochondrial proteins? J. Mol. Evol. (1998) 47:249–257.[CrossRef][Web of Science][Medline]

    Baldauf S. L., Palmer J. D. Animals and fungi are each other's closest relatives: Congruent evidence from multiple proteins. Proc. Natl Acad. Sci. U.S.A. (1993) 90:11558–11562.[Abstract/Free Full Text]

    Bandelt H.-J., Dress A. W. M. Split decomposition: A new and useful approach to phylogenetic analysis of distance data. Mol. Phylogenet. Evol. (1992) 1:242–252.[CrossRef][Medline]

    Bruno W. J., Halpern A. L. Topological bias and inconsistency of maximum likelihood using wrong models. Mol. Biol. Evol. (1999) 16:564–566.[Web of Science][Medline]

    Bruno W. J., Socci N. D., Halpern A. L. Weighted neighbor joining: A likelihood-based approach to distance-based phylogeny reconstruction. Mol. Biol. Evol. (2000) 17:189–197.[Abstract/Free Full Text]

    Chang B. S. W., Campbell D. L. Bias in phylogenetic reconstruction of vertebrate rhodopsin sequences. Mol. Biol. Evol. (2000) 17:1220–1231.[Abstract/Free Full Text]

    Chang J. T. Inconsistency of evolutionary tree topology reconstruction methods when substitution rates vary across characters. Math. Biosci. (1996) 134:189–215.[CrossRef][Web of Science][Medline]

    Conant G. C., Lewis P. O. Effects of nucleotide composition bias on the success of the parsimony criterion on phylogenetic inference. Mol. Biol. Evol. (2001) 18:1024–1033.[Abstract/Free Full Text]

    De Bry R. W. The consistency of several phylogeny-inference methods under varying evolutionary rates. Mol. Biol. Evol. (1992) 9:537–551.[Abstract]

    Edlind T. D., Li J., Visvesvara G. S., Vodkin M. H., McLaughlin G. L., Katiyar S. K. Phylogenetic analysis of β -tubulin sequences from amitochondrial protozoa. Mol. Phylogenet. Evol. (1996) 5:359–367.[CrossRef][Web of Science][Medline]

    Farris J. S. Likelihood and inconsistency. Cladistics (1999) 15:199–204.[CrossRef][Web of Science]

    Felsenstein J. Cases in which parsimony or compatibility methods will be positively misleading. Syst. Zool. (1978) 27:401–410.[Abstract/Free Full Text]

    Felsenstein J. Evolutionary trees from DNA sequences: A maximum likelihood approach. J. Mol. Evol. (1981) 17:368–376.[CrossRef][Web of Science][Medline]

    Felsenstein J. PHYLIP (Phylogeny Inference Package), version 3.6a3 (2002) Seattle: Department of Genetics, University of Washington. Available from http://evolution.genetics.washington.edu/phylip.html.

    Fitch W. F. Towards defining the course of evolution: Minimum change for a specific tree topology. Syst. Zool. (1971) 20:406–416.[Abstract]

    Fitch W. F. The estimate of total nucleotide substitutions from pairwise differences is biased. Phil. Trans. R. Soc. Lond. B (1986a) 312:317–324.[Abstract/Free Full Text]

    Fitch W. M. An estimation of the number of invariable sites is necessary for the accurate estimation of the number of nucleotide substitutions since a common ancestor. Prog. Clin. Biol. Res. (1986b) 218:149–159.[Medline]

    Galtier N., Gouy M. Inferring phylogenies from DNA sequences of unequal base compositions. Proc. Natl Acad. Sci. U.S.A. (1995) 92:11317–11321.[Abstract/Free Full Text]

    Galtier N., Gouy M. Inferring pattern and process: Maximum-likelihood implementation of a nonhomogenous model of DNA sequence evolution for phylogenetic analysis. Mol. Biol. Evol. (1998) 15:871–879.[Abstract]

    Gaut B. S., Lewis P. O. Success of maximum likelihood phylogeny inference in the four-taxon case. Mol. Biol. Evol. (1995) 12:152–162.[Abstract]

    Hasegawa M., Hashimoto T. Ribosomal RNA trees misleading? Nature (1993) 361:23.[Medline]

    Hendy M. D. Spectral analysis of phylogenetic data. J. Classif. (1993) 10:5–24.[CrossRef]

    Hendy M. D., Charleston M. A. Hadamard conjugation: A versatile tool for modelling nucleotide sequence evolution. N. Z. J. Bot. (1993) 31:231–237.

    Hendy M. D., Penny D. A framework for the quantitative study of evolutionary trees. Syst. Zool. (1989) 38:297–309.[Abstract/Free Full Text]

    Hillis D. M., Huelsenbeck J. P. Success of phylogenetic methods in the four-taxon case. Syst. Biol. (1993) 44:17–48.

    Hillis D. M., Huelsenbeck J. P., Cunningham C. W. Application and accuracy of molecular phylogenies. Science (1994a) 264:671–677.[Abstract/Free Full Text]

    Hillis D. M., Huelsenbeck J. P., Swofford D. L. Hobgoblin of phylogenetics. Nature (1994b) 369:363–364.[CrossRef][Medline]

    Holland B. R., Penny D., Hendy M. D. Outgroup misplacement and phylogenetic inaccuracy under a molecular clock—A simulation study. Syst. Biol. (2003) 52:229–238.[Abstract/Free Full Text]

    Huelsenbeck J. P. Tree-length distribution skewness: An indicator of phylogenetic information. Syst. Zool. (1991) 40:257–270.[Abstract/Free Full Text]

    Jakobsen I. B., Wilson S. R., Easteal S. The partition matrix: Exploring variable phylogenetic signals along nucleotide sequence alignments. Mol. Biol. Evol. (1997) 14:474–484.[Abstract]

    Jermiin L. S., Ho S. Y. W., Ababneh F., Robinson J., Larkum A. W. D. Hetero: A program to simulate the evolution of nucleotide sequences on a binary tree with four tips. Appl. Bioinf. (2003) 2:159–163.

    Jermiin L. S., Ho S. Y. W., Ababneh F., Robinson J., Larkum A. W. D. The biasing effect of compositional heterogeneity on phylogenetic estimates may be underestimated. Syst. Biol. (2004) 53:638–643.[Free Full Text]

    Jukes T. H., Cantor C. R. Evolution of protein molecules. In: Mammalian protein metabolism—Munro H. N., ed. (1969) New York: Academic Press. 21–132. Pages.

    Keeling P. J., Doolittle W. F. Alpha-tubulin from early-diverging eukaryotic lineages and the evolution of the tubulin family. Mol. Biol. Evol. (1996) 13:1297–1305.[Abstract]

    Keeling P. J., Luker M. A., Palmer J. D. Evidence from beta-tubulin that microsporidia evolved from fungi. Mol. Biol. Evol. (2000) 17:23–31.[Abstract/Free Full Text]

    Kim J. General inconsistency conditions for maximum parsimony: Effects of branch lengths and increasing numbers of taxa. Syst. Biol. (1996) 45:363–374.[Abstract/Free Full Text]

    Lake J. A. Reconstructing evolutionary trees from DNA and protein sequences: Paralinear distances. Proc. Natl Acad. Sci. U.S.A. (1994) 91:1155–1159.

    Lento G. M., Hickson R. E., Chambers G. K., Penny D. Use of spectral analysis to test hypotheses on the origin of pinnipeds. Mol. Biol. Evol. (1995) 12:28–52.[Abstract]

    Li J., Katiyar K., Hamelin A., Visvesvara G. S., Edlind T. D. Tubulin genes from AIDS-associated microsporidia and implications for phylogeny and benzimidazole sensitivity. Mol. Biochem. Parasitol. (1996) 78:289–295.[CrossRef][Web of Science][Medline]

    Li W., Tanimura M., Sharp P. Rates and dates of divergence between AIDS virus nucleotide sequences. Mol. Biol. Evol. (1988) 5:313–330.[Abstract]

    Lockhart P. J., Howe C. J., Bryant D. A., Beanland T. J., Larkum A. W. D. Substitutional bias confounds inference of cyanelle origins from sequence data. J. Mol. Evol. (1992) 34:153–162.[Web of Science][Medline]

    Lockhart P. J., Larkum A. W. D., Steel M. A., Waddell P. J., Penny D. Evolution of chlorophyll and bacteriochlorophyll: The problem of invariant sites in sequence analysis. Proc. Natl Acad. Sci. U.S.A. (1996) 93:1930–1934.[Abstract/Free Full Text]

    Lockhart P. J., Steel M. A., Hendy M. D., Penny D. Recovering evolutionary trees under a more realistic model of sequence evolution. Mol. Biol. Evol. (1994) 11:605–612.[Web of Science][Medline]

    Mooers A. O., Holmes E. C. The evolution of base composition and phylogenetic inference. Trends Ecol. Evol. (2000) 15:365–369.[CrossRef][Medline]

    Muse S. V., Weir B. S. Testing for equality of evolutionary rates. Genetics (1992) 132:269–276.[Abstract]

    Penny D., Hendy M. D., Steel M. A. Progress with methods for constructing evolutionary trees. Trends Ecol. Evol. (1992) 7:73–79.[CrossRef]

    Phillips M. J., Lin Y. H., Harrison G. L., Penny D. Mitochondrial genomes of a bandicoot and a brushtail possum confirm the monophyly of australidelphian marsupials. Proc. R. Soc. Lond. B (2001) 268:1533–1538.[Abstract/Free Full Text]

    Pol D., Siddall M. E. Biases in maximum likelihood and parsimony: A simulation approach to a 10-taxon case. Cladistics (2001) 17:266–281.[CrossRef][Web of Science]

    Rosenberg M. S., Kumar S. Heterogeneity of nucleotide frequencies among evolutionary lineages and phylogenetic inference. Mol. Biol. Evol. (2003) 20:610–621.[Abstract/Free Full Text]

    Saccone C., Giorgi C. D., Gissi C., Pesole G., Reyes A. Evolutionary genomics in Metazoa: The mitochondrial DNA as a model system. Gene (1999) 238:195–209.[CrossRef][Web of Science][Medline]

    Saitou N., Nei M. The neighbor-joining method: A new method for reconstructing phylogenetic trees. Mol. Biol. Evol. (1987) 4:406–425.[Abstract]

    Schütze J., Krasko A., Custodio M. R., Efremova S. M., Müller I. M., Müller W. E. G. Evolutionary relationships of Metazoa within the eukaryotes based on molecular data from Porifera. Proc. R. Soc. Lond. B (1999) 266:63–73.[Abstract/Free Full Text]

    Siddall M. E. Success of parsimony in the four-taxon case: Long-branch repulsion by likelihood in the Farris zone. Cladistics (1998) 14:209–220.[CrossRef][Web of Science]

    Simpson A. G. B., Roger A. J., Silberman J. D., Leipe D., Edgcomb V. P., Jermiin L. S., Patterson D. J., Sogin M. L. Evolutionary history of "early diverging" eukaryotes: The excavate taxon Carpediomonas is a close relative of Giardia. Mol. Biol. Evol. (2002) 19:1782–1791.[Abstract/Free Full Text]

    Steel M. A. Recovering a tree from the leaf colourations it generates under a Markov model. Appl. Math. Lett. (1994) 7:19–23.

    Steel M. A., Lockhart P. J., Penny D. Confidence in evolutionary trees from biological sequence data. Nature (1993) 364:440–442.[CrossRef][Medline]

    Sueoka N. Compositional variation and heterogeneity of nucleic acids and protein in bacteria. In: The bacteria, volume V: Heredity—Gunsalus I. C., Stanier R. Y., eds. (1964) New York: Academic Press. 419–443. Pages.

    Swofford D. L., Olsen G. J., Waddell P. J., Hillis D. M. Phylogenetic inference. In: Molecular systematics—Hillis D. M., Moritz C., Mable B. K., eds. (1996) Sunderland, Massachusetts: Sinauer Associates. 407–514. Pages.

    Swofford D. L., Waddell P. J., Huelsenbeck J. P., Foster P. G., Lewis P. O., Rogers J. S. Bias in phylogenetic estimation and its relevance to the choice between parsimony and likelihood methods. Syst. Biol. (2001) 50:525–539.[Free Full Text]

    Tarrío R., Rodríguez-Trelles F., Ayala F. J. Shared nucleotide composition biases among species and their impact on phylogenetic reconstructions of the Drosophilidae. Mol. Biol. Evol. (2001) 18:1464–1473.[Abstract/Free Full Text]

    Tateno Y., Takezaki N., Nei M. Relative efficiencies of the maximum-likelihood, neighbor-joining, and maximum-parsimony methods when substitution rate varies with site. Mol. Biol. Evol. (1994) 11:261–277.[Abstract]

    Wu C.-I., Li W.-H. Evidence for higher rates of nucleotide substitution in rodents than in man. Proc. Natl Acad. Sci. U.S.A. (1985) 82:1741–1745.[Abstract/Free Full Text]

    Yang Z. Maximum likelihood estimation of phylogeny from DNA sequences when substitution rates differ over sites. Mol. Biol. Evol. (1993) 10:1396–1401.[Abstract]

    Yang Z., Roberts D. On the use of nucleic acid sequences to infer early branches in the tree of life. Mol. Biol. Evol. (1995) 12:451–458.[Abstract]

    Zharkikh A., Li W.-H. Inconsistency of the maximum-parsimony method: The case of five taxa with a molecular clock. Syst. Biol. (1993) 42:113–125.[Abstract/Free Full Text]


Add to CiteULike CiteULike   Add to Connotea Connotea   Add to Del.icio.us Del.icio.us    What's this?


This article has been cited by other articles:


Home page
Phil Trans R Soc BHome page
P. G. Foster, C. J. Cox, and T. M. Embley
The primary divisions of life: a phylogenomic approach employing composition-heterogeneous methods
Phil Trans R Soc B, August 12, 2009; 364(1527): 2197 - 2207.
[Abstract] [Full Text] [PDF]


Home page
Syst BiolHome page
N. C. Sheffield, H. Song, S. L. Cameron, and M. F. Whiting
Nonstationary Evolution and Compositional Heterogeneity in Beetle Mitochondrial Phylogenomics
Syst Biol, August 1, 2009; 58(4): 381 - 394.
[Abstract] [Full Text] [PDF]


Home page
Proc. Natl. Acad. Sci. USAHome page
C. J. Cox, P. G. Foster, R. P. Hirt, S. R. Harris, and T. M. Embley
The archaebacterial origin of eukaryotes
PNAS, December 23, 2008; 105(51): 20356 - 20361.
[Abstract] [Full Text] [PDF]


Home page
Mol Biol EvolHome page
L. Shavit, D. Penny, M. D. Hendy, and B. R. Holland
The Problem of Rooting Rapid Radiations
Mol. Biol. Evol., November 1, 2007; 24(11): 2400 - 2411.
[Abstract] [Full Text] [PDF]


Home page
Mol Biol EvolHome page
W. White, S. Hills, R Gaddam, B. Holland, and D. Penny
Treeness Triangles: Visualizing the Loss of Phylogenetic Signal
Mol. Biol. Evol., September 1, 2007; 24(9): 2029 - 2039.
[Abstract] [Full Text] [PDF]


Home page
Mol Biol EvolHome page
E. Susko and A. J. Roger
On Reduced Amino Acid Alphabets for Phylogenetic Inference
Mol. Biol. Evol., September 1, 2007; 24(9): 2139 - 2150.
[Abstract] [Full Text] [PDF]


Home page
Mol Biol EvolHome page
G. A. Huttley, M. J. Wakefield, and S. Easteal
Rates of Genome Evolution and Branching Order from Whole Genome Analysis
Mol. Biol. Evol., August 1, 2007; 24(8): 1722 - 1730.
[Abstract] [Full Text] [PDF]


Home page
Syst BiolHome page
V. Jayaswal, J. Robinson, and L. Jermiin
Estimation of Phylogeny and Invariant Sites under the General Markov Model of Nucleotide Sequence Evolution
Syst Biol, April 1, 2007; 56(2): 155 - 162.
[Abstract] [Full Text] [PDF]


Home page
Mol Biol EvolHome page
D. Baurain, H. Brinkmann, and H. Philippe
Lack of Resolution in the Animal Phylogeny: Closely Spaced Cladogeneses or Undetected Systematic Errors?
Mol. Biol. Evol., January 1, 2007; 24(1): 6 - 9.
[Abstract] [Full Text] [PDF]


Home page
BioinformaticsHome page
J. W. K. Ho, C. E. Adams, J. B. Lew, T. J. Matthews, C. C. Ng, A. Shahabi-Sirjani, L. H. Tan, Y. Zhao, S. Easteal, S. R Wilson, et al.
SeqVis: Visualization of compositional heterogeneity in large alignments of nucleotides
Bioinformatics, September 1, 2006; 22(17): 2162 - 2163.
[Abstract] [Full Text] [PDF]


Home page
Phil Trans R Soc BHome page
T Martin Embley
Multiple secondary origins of the anaerobic lifestyle in eukaryotes
Phil Trans R Soc B, June 29, 2006; 361(1470): 1055 - 1067.
[Abstract] [Full Text] [PDF]


Home page
BioinformaticsHome page
F. Ababneh, L. S. Jermiin, C. Ma, and J. Robinson
Matched-pairs tests of homogeneity with applications to homologous nucleotide sequences
Bioinformatics, May 15, 2006; 22(10): 1225 - 1231.
[Abstract] [Full Text] [PDF]


Home page
Mol Biol EvolHome page
B. R. Holland, L. S. Jermiin, and V. Moulton
Improved Consensus Network Techniques for Genome-Scale Phylogeny
Mol. Biol. Evol., May 1, 2006; 23(5): 848 - 855.
[Abstract] [Full Text] [PDF]


Home page
Syst BiolHome page
M. J. Phillips, P. A. McLenachan, C. Down, G. C. Gibb, and D. Penny
Combined Mitochondrial and Nuclear DNA Sequences Resolve the Interrelations of the Major Australasian Marsupial Radiations
Syst Biol, February 1, 2006; 55(1): 122 - 137.
[Abstract] [Full Text] [PDF]


Home page
Syst BiolHome page
P. Lockhart and M. Steel
A Tale of Two Processes
Syst Biol, December 1, 2005; 54(6): 948 - 951.
[Full Text] [PDF]


Home page
Mol Biol EvolHome page
S. Y. W. Ho, M. J. Phillips, A. Cooper, and A. J. Drummond
Time Dependency of Molecular Rate Estimates and Systematic Overestimation of Recent Divergence Times
Mol. Biol. Evol., July 1, 2005; 22(7): 1561 - 1568.
[Abstract] [Full Text] [PDF]


Home page
Syst BiolHome page
L. S. Jermiin, S. Y.W. Ho, F. Ababneh, J. Robinson, and A. W.D. Larkum
The Biasing Effect of Compositional Heterogeneity on Phylogenetic Estimates May be Underestimated
Syst Biol, August 1, 2004; 53(4): 638 - 643.
[Full Text] [PDF]


This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (PDF) Freely available
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in ISI Web of Science
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrow Search for citing articles in:
ISI Web of Science (34)
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Ho, S. Y.W.
Right arrow Articles by Jermiin, L. S.
Right arrow Search for Related Content
PubMed
Right arrow Articles by Ho, S. Y.W.
Right arrow Articles by Jermiin, L. S.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?