Skip Navigation

Systematic Biology 2007 56(1):68-82; doi:10.1080/10635150601175578
This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (PDF) Freely available
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in ISI Web of Science
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrow Search for citing articles in:
ISI Web of Science (7)
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Ruano-Rubio, V.
Right arrow Articles by Fares, M. A.
Right arrow Search for Related Content
PubMed
Right arrow Articles by Ruano-Rubio, V.
Right arrow Articles by Fares, M. A.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?

© 2007 Society of Systematic Biologists

Artifactual Phylogenies Caused by Correlated Distribution of Substitution Rates among Sites and Lineages: The Good, the Bad, and the Ugly

Edited by Tim Collins: Associate Editor

Valentin Ruano-Rubio1,2 and Mario A. Fares1,2

1 Molecular Evolution and Bioinformatics Laboratory, Department of Biology, National University of Ireland Maynooth, Ireland


    Abstract
 Top
 Notes
 Abstract
 Methods
 Results and Discussion
 Remarks
 Appendix
 Acknowledgments
 REFERENCES
 
Despite the advances in understanding molecular evolution, current phylogenetic methods barely take account of a fraction of the complexity of evolution. We are chiefly constrained by our incomplete knowledge of molecular evolutionary processes and the limits of computational power. These limitations lead to the establishment of either biologically simplistic models that rarely account for a fraction of the complexity involved or overfitting models that add little resolution to the problem. Such oversimplified models may lead us to assign high confidence to an incorrect tree (inconsistency). Rate-across-site (RAS) models are commonly used evolutionary models in phylogenetic studies. These account for heterogeneity in the evolutionary rates among sites but do not account for changing within-site rates across lineages (heterotachy). If heterotachy is common, using RAS models may lead to systematic errors in tree inference. In this work we show possible misleading effects in tree inference when the assumption of constant within-site rates across lineages is violated using maximum likelihood. Using a simulation study, we explore the ways in which gamma stationary models can lead to wrong topology or to deceptive bootstrap support values when the within-site rates change across lineages. More precisely, we show that different degrees of heterotachy mislead phylogenetic inference when the model assumed is stationary. Finally, we propose a geometry-based approach to visualize and to test for the possible existence of bias due to heterotachy.

Keywords: Covarions; heterotachy; maximum likelihood; phylogenetic bias; rate-across-site; systematic error

Received November 2, 2005; Revised January 6, 2006; Accepted August 25, 2006


There are several factors that may hamper the inference of reliable phylogenies, including insufficient data and oversimplified models. At best, these model deficiencies may lead to random noise that does not result in erroneous tree topologies. However, correlated parameters unaccounted for in the tree inference model can result in systematic biases towards highly supported erroneous topologies, branch lengths, and other parameter values.

Typically, probabilistic models proposed for the molecular evolution at a site are based on a discrete Markov process where each state represents a different character value (a nucleotide or amino acid) and transitions between states represent a site change fixation. Early methods assumed an identical and independent evolution process for all sites. Uzzell and Corbin (1971) accounted for the evolutionary heterogeneity among sites using rate-across-site (RAS) models. In these RAS models, site-specific relative rates are tied under a parametric sampling distribution, usually a gamma {Gamma} ({alpha},β) distribution. Despite their usefulness to approximate empirical data, RAS models are simplistic, as they do not take into account changes in the functional, environmental, and selective constraints acting on amino acid sites of a protein over time.

Covarion-like models were among the first to account for variations in evolutionary constraints among lineages. Fitch and Markowitz (1970) coined the term covarion, but covarion-like models were not given much attention until recently (e.g., Tuffley and Steel, 1998; Lopez et al., 1999; Galtier, 2001; Penny et al., 2001; Huelsenbeck, 2002; Kolaczkowski and Thornton, 2004; Philippe et al., 2005; Ane et al., 2005). Under these models, two separate Markov processes describe molecular evolution. The first is the typical transition between character states (amino acids for protein, nucleotides for DNA), whereas the second allows sites to switch between on-off states. Sites in the off state will remain constant, whereas those in the on state accept changes following the former process. This covarion model can explain an overall RAS-like heterogeneity, as every site relative rate would depend on the time interval it has spent in the on state; the longer the time, the higher the rate. It also justifies the presence of extensive heterotachy, or changing rates, for a particular site across lineages that cannot be well explained by RAS models (Lockhart et al., 1998; Lopez et al., 2002; Philippe et al., 2005; Ane et al., 2005).

Extensions to this initial covarion model may be proposed to convey more accurately the complexity present in real data with a limited set of parameters. For example, Galtier (2001) considered a discrete gamma [{Gamma} ({alpha})] model that adds a fixed instantaneous rate of change, rate-of-rates (v), between rate categories for each lineage on every site. This extended covarion model does not allow, however, for rate-of-rates change between lineages that may be responsible of prsducing phylogenetic biases.

Currently, most phylogenetic studies use RAS models on simplicity and software availability grounds. Simultaneously, some methods have been proposed to determine significant within-site changes in evolutionary rates across lineages (Gu, 1999, 2001; Knudsen and Miyamoto, 2001; Mooers and Holmes, 2000; Susko et al., 2002). Other authors have carried out work on topologically misleading effects with special focus on shared invariant sites (Lockhart et al., 1998; Susko et al., 2004), on highly divergent or even random sites (Brinkmann and Philippe, 1999; Pisani, 2004; Susko et al., 2005), or without restriction to any of those two extremes (Chang, 1996; Huelsenbeck, 1998; Inagaki et al., 2004; Kolaczkowski and Thornton, 2004).

Deviations from the RAS assumptions may lead to systematic errors. Chang (1996) showed that taxa tend to cluster together in a phylogeny when they display similarities in the distribution of rates among sites. Susko et al. (2004) reported that distance correction to account for heterogeneity in distance-based methods might become a one-way trip into long-branch attraction artifacts. Their results pinpoint a tendency towards long-branch attraction or repulsion, even using maximum likelihood (ML) methods, when the model and parameter values induce systematic misestimation of genetic distances. A covarion-like evolving data set, as those described by Tuffley and Steel (1998) or Galtier (2001), does not necessarily lead to an incorrect phylogeny. In fact, true sister groups may well present a greater correlation in their relative site rates and hence tend to be clustered together in the phylogeny. Heterogeneity in the rate-of-rates parameter values among lineages may however lead to mistaken phylogenetic groupings, as we will suggest in this study.

Here, we first investigate possible sources of bias from assuming RAS models in applying the maximum likelihood criterion on data that evolved following a covarion-like model of evolution. More precisely, we focus on evolutionary scenarios where lineages differ in the assignation of rate category per site. Then we propose a tool to visualize and test the presence of such biases.


    Methods
 Top
 Notes
 Abstract
 Methods
 Results and Discussion
 Remarks
 Appendix
 Acknowledgments
 REFERENCES
 
Simulation Model
To reproduce the bias caused by changes in evolutionary constraints, we used a similar approach as that used in previous work (Chang, 1996; Kolaczkowski and Thornton, 2004; Spencer et al., 2005; Philippe et al., 2005). However, we introduced important differences regarding the phylogeny used in simulations: the tree comprised a quartet of subtrees rather than a quartet of taxa, we considered a set of coefficients that account for the conservation of the relative site rate categories at the deepest divergence events (internal nodes) of the tree; we also accounted for possible differences in the shape parameter of the gamma RAS distribution of substitutions among subtrees. In this study we did not consider the effect of bipartitions of the data matrix based on different sets of branch lengths (see Fig. 2) as done in other studies; however, heterotachy is present due to random independent changes in relative rate category per site or difference in rate distributions across lineages.

We focused on unrooted phylogenies presenting four well-resolved monophyletic groups (Fig. 1). Each group is composed of a number of taxa and evolves under a standard RAS model. The shape of each subtree is controlled by three parameters (Fig. 1c): the total height of the tree, h, the number of simultaneous divergence events along every lineage, d (thus the total number of leaves is 2d), and the relative branch length ratio, r, that modulates the temporal spacing between divergence events. The motivation behind considering subtrees rather than the typical unrooted four-taxa phylogeny is twofold: first, to provide enough data to infer back the rate for a site on each subtree and therefore its relative rate on that lineage, and second, to record the influence of different taxa sampling contexts on the bias strength.


Figure 1
View larger version (30K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Figure 1 Covarion-like model used for the study. (a) Site relative rate categories may change across the two innermost nodes of the tree. Conservation coefficients {theta}n{alpha}, {theta}nβ, and {theta}n{gamma} represent the rate category transition probabilities between the node (n) and each branch and beyond. (b) The resulting overall model is a quartet, with four resolved RAS processes and six conservation coefficients. We collapsed coefficients at the inner branch into one, {theta}ab, being {theta}ai and {theta}ib equal to {theta}ab1/2 for simplicity. (c) Each subtree shape is determined by three parameters: h indicates the total height of the tree, r the ratio between branch length after and before each divergence event, and d the number of divergence events from root to leaves. This figure illustrates a subtree with d = 3, therefore with 8 taxa. The branch length nuisance variable l is set so that the sum l + lr + lr2+ ... + lrd matches the total height h.

 
We modeled heterogeneity among sites using a discrete approximation to {Gamma} ({alpha}) rate distribution with the number of rate categories C = 16 on all branches. Initially we considered an RAS model with each site belonging to the same rate category throughout the tree. However, we included the possibility of changing the rate categories per site instantaneously on the two most internal nodes or divergence events, which led to the four well-supported subtrees. On both nodes we defined three transition probability matrices Pn{alpha}, Pnβ, and Pn{gamma}, one per each incident branch. Each matrix element pij indicates the probability for a site with category i on the node to change to category j that will apply to the incident branch and beyond. As we constrained all branch rates to follow a discrete gamma distribution, all matrices must yield equally probable stationary category frequency vectors F = {C–1}C. We further constrained the family of valid matrices in order to be able to parameterize the change in the distribution of rates across the tree with a single tuning value, {theta}, which we used for the site rate category conservation coefficient throughout the text (Fig. 1a, Fig. 1b). Its value indicates the proportion of sites whose rate category is unchanged at the divergence event, whereas (1 – {theta}) indicates the percentage of sites that is susceptible to change to another rate category. In the latter case, the new category will be selected at random from all C possible categories including its current value. Thus the resulting transition matrix is:


Formula 1

(1)
where:


Formula 2

(2)

Under this restriction, reversion and transitivity are trivial because Pxy is symmetric:


Formula 3

(3)
where z is reachable from x through node y. We based our simulations on a JTT (Jones et al., 1992) Markov process. Nonetheless, qualitative conclusions are extensible to other evolutionary models.

Simulation Cases
Due to computational limitations we only investigated covarion bias effects on three different parametric scenarios: (a) presence of changes in rate categories that are equally frequent in nonsister subtrees, (b) presence of changes in rate categories in the inner branch connecting phylogenetic pairs of subtrees; and (c) existence of different site rate sampling distributions. For all the simulations conducted here, we produce amino acid data sets following the model explained above based on a JTT substitution matrix. Each parameter values combination was repeated 50 to 100 times to reduce stochasticity. Unless otherwise indicated, alignments were 1000 amino acids long. We utilized the program CODEML from the PAML package version 3.14 (Yang, 1997) for evaluating the phylogenetic support based on RELL bootstrap proportions (Kishino and Hasegawa, 1989). For each data set we compared the support obtained by the three unrooted possible trees that were constrained so that all four subtrees are monophyletic and their topologies were known. Parameter values were specific for each simulation batch. All diagrams plot mean RELL support index among repetitions. Notice that this is not the same as considering the percentage of cases where that tree topology is the one that maximizes the likelihood, which is typically the statistic reported in this kind of studies. We used the R statistical package to produce all diagrams and specifically the locfit module to fit surfaces for the three-dimensional plots, and geneplotter for the triangular plots in Figure 10.

Geometrical Visualization of Bias
Intuitively, sites that may introduce bias towards a particular hypothesis (topology) are those for which the two non-sister lineages have similar rates and different from the other two lineages (Fig. 2). Therefore, assuming that we have enough data to accurately estimate rates per site per lineage, we can classify sites according to the hypothesis they favor. We expect that a proportion of sites support a particular topology due to stochastic events alone. However, their overall effect must cancel out and result in an unbiased sample.


Figure 2
View larger version (11K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Figure 2 A simple bipartition heterogeneous model that produces biased phylogenies. A percentage, p, of sites evolves under a set of branch lengths, whereas the rest does so under a complementary set of branch lengths. Non-sister-group site relative rates are positively correlated, whereas real phylogenetic groups are negatively correlated. This conflicts with an RAS evolutionary process where relative branch lengths must be kept constant across all sites.

 
There are three possible topologies for a fully resolved four-subtree–based unrooted phylogeny. A simple way to represent graphically these hypotheses is using the equidistant vertices of a regular triangle (Fig. 3). Others have used triangle geometry previously in a phylogenetic context although with a different purpose (Strimmer and von Haeseler, 1997; Kim, 2000). Within such a layout, impartial sites (those not favoring a particular hypothesis) will fall close to the middle point, the barycenter, whereas partial sites will accumulate towards peripheral areas close to a vertex, depending on which hypothesis or hypotheses they do favor.


Figure 3
View larger version (11K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Figure 3 Triangular layout used for representing bias visually. The three equidistant vertices represent the three possible unrooted topologies. Sites that show same relative rates for all subtrees should fall close to the barycenter of the triangle, whereas different degrees of site rate correlation between subtrees will be highlighted by the points located closer to one or two of the topologies represented by the triangle vertices.

 
In order to compare rates per site between different subtrees, we used empirical Bayesian estimates as a proxy (Mayrose et al., 2004). We calculated these for each subtree independently. At this point we assume a Poisson model (Juke and Cantor, 1969) rather than the actual JTT rate-matrix because using an empirical matrix for low evolving sites would yield rates highly dependent on site composition. This dependence tends to cluster groups of taxa that share a greater number of constant or nearly constant sites due to the fact they are slow-evolving lineages in cases that combine subtrees with pronounced height differences. Here it is not as critical to obtain accurate estimators as to produce a reasonable order that is not sensitive to the conserved or convergent site composition. For each subtree we ranked sites by their rate estimator, resolving ties by randomization. Thus, Rx(i), denotes the rank for site i on the subtree x taking values between 0 and 1 in steps of 1/(data matrix length).

Plotting Individual Sites
Under the null hypothesis (RAS model) all subtrees should consistently have equal relative rates at each site across lineages; that is, a fast-evolving site on a lineage has to be proportionally fast evolving in all other lineages. In other words, if sites on certain group are ranked based on their individual rates on that lineage, independent ranking performed on other parts of the phylogeny should yield a similar order (positively correlated with the former) depending on the variance of the rate estimator used. Nonetheless, in non-RAS processes, as the covarion process described above, within-site rate changes break to certain degree this site-by-site correspondence between relative site rates among lineages and their ranking. Lineages with more similar assignment of rate categories along sites should yield higher correlated independent site rankings.

Based on previous observations, lineages with more similar evolutionary processes should tend to attract each other. In this context, we may say that a site "prefers" a hypothesis [(x,y),v,w] if for that site, ranks on lineages x, y and v, w are closer within pairs and different between pairs. A way to reflect this in quantitative terms is that a site will prefer a particular hypothesis if |Rx(i) – Ry(i)| and |Rv(i) – Rw(i)| are small (close to zero) and the average AVG(|Rx(i) – Rv(i)|, |Rx(i) Rw(i)|, |Ry(i) – Rw(i)|, |Ry(i) – Rv(i)|) is big (close to one).

Instead of determining individual support indices for each site and hypothesis, we calculated coefficients that give to each hypothesis a weight proportional to the similarity of ranking of their two subtree pairs. Under this approach we yield three weight values, with only two degrees of freedom.

Let us consider the three hypotheses:


Formula 4

(4)

For each site i we calculated the value F that measures the amount of relative rate change between two subtrees (x and y) as:


Formula 5

(5)

Small differences in rates (rate conservation or convergence) will yield values close to 1, whereas extreme differences should yield F value closer to 0. F values are directly comparable only if they are estimated from subtrees with equivalent shapes (topology and branch lengths) as distributions of rate estimators used to rank sites depend on these subtree shapes. Therefore, in order to normalize ranking differences between all subtrees, we performed parametric simulations (using 1000 replicates) for each rate category based on RAS model assumptions. In these simulations, we used the parameter estimators obtained by ML analysis performed in PAML. Then we substituted the F values at each site by the complement of their corresponding cumulative probability F' assuming that its rate category is the one with maximum posterior probability. This will yield uniformly distributed values from 0 to 1.


Formula 6

(6)

Then we measure the total amount of change on a site i for each hypothesis with a second function, G, defined as the product of the indices for its two individual phylogenetic subtree pairs:


Formula 7

(7)

The resulting hypothesis indices also vary from 1 (total conservation of the substitution rate categories per site within pairs) to 0 (extreme change of the substitution rate categories per site at least within one pair). Then, we defined the weight, {omega}, on site i associated to a particular hypothesis j as:


Formula 8

(8)

Finally, we used these weights for combining arithmetically the three triangle vertices (hypotheses) in order to plot a single point that represents the site within the triangle. The bidimensional location (pi) of that point can be calculated as:


Formula 9

(9)

Here hj stands for the two-dimensional coordinates of the vertex that represents hypothesis Hj. Notice that we can implement many alternative G or F functions generating different weight coefficients and plots with different statistical properties. Nonetheless, any alternative must be designed as to allow impartial sites to concentrate around the center of the triangle, whereas partial sites must cluster towards the corners (one hypothesis wins) or the edges (two draw and one loses).

Bias Test and Measurement
We can evaluate the assumptions in the RAS models by testing the goodness-of-fit of the expected to the observed distribution densities of the points in the triangle. Even though this may spot nonstandard samples, it would not automatically indicate which hypothesis benefits from that deviation. Besides, some samples may still be balanced and yet have significantly different plot densities. Alternatively, we could take a look at the mass center of the plot, that is, the arithmetic mean of all site points.

In the ideal case of equivalent subtrees with the same shape (topology and branch lengths), and assuming a genuine RAS evolution model, the mass center must be placed on the triangle barycenter equidistant from each tree hypothesis (topology) regardless of the presence or lack of any phylogenetic signal. Asymmetric heavy tails must pull this mean towards the winning alternative hypothesis (or further from the losing one). In consequence, one could use this to infer whether the sample is biased with respect to one or more hypotheses. Provided that the distribution of the mean plot point position approaches a bivariate normal distribution, we can readily obtain a simple formula to calculate a confidence area for the sample mass center as developed in the Appendix:


Formula 10

(10)
where D designates the Euclidean distance from the sample mean to the barycenter, {sigma}x is the standard deviation of the sample's projection on any axis of choice with origin in the barycenter, and L is the sample size (data matrix length).

Nonetheless, distinct branch lengths and different number of taxa may affect the distribution of dots on the plot and change the mass center. This is due to the fact that the G values for each hypothesis, as described in Methods, are not independent from each other because they combine shared subtree estimators. In practice, however, the test starts to become liberal just with a great amount of data (10,000 sites) and severe tree shape height asymmetry (e.g., when trees present subtrees over 8 times taller, or with 4 times greater number of taxa, than others). A general solution to this issue is to further simulate the expected site plot and mass center, given the ML estimators of the full phylogeny, and to use the distance to this alternative center. Even though Equation (10) is still a good approximation under such circumstances, sample mean comparison tests, such as the Wilcoxon test, are probably more appropriate.

Another aspect to bear in mind is that since we sort rank ties by randomization, each execution can potentially return a different P-value. In order to stabilize the outcome, the test must be repeated a number of times and an averaged P-value must be selected. However, under this procedure the test becomes somewhat conservative. In this work we will proceed without repeats, as results for single data sets are not critical in the simulations.

As an alternative to the null hypothesis presented above, data sets can show different degrees of relative site rate correlation among lineages that may cause systematic distortions in the resulting topology and branch estimators. Interpretation of the results of such a test is independent of the amount of actual phylogenetic signal (inner branch length). Therefore, results can only point to the possibility of having inferred an artifactual phylogeny by ML under a RAS model.

The simulation and test programs were implemented in Java programming language utilizing functionality from the Java standard library, PAL library (Drummond and Strimmer, 2001), and Jakarta commons. We delegated on rates4site software (Mayrose et al., 2004) to obtain empirical Bayesian site rate estimates per each subtree.


    Results and Discussion
 Top
 Notes
 Abstract
 Methods
 Results and Discussion
 Remarks
 Appendix
 Acknowledgments
 REFERENCES
 
We investigated the effect of inducing different levels of site rate correlation between subtrees using the simulation model based on four monophyletic groups of taxa a1, a2, b1, and b2 as described in Methods and Figure 1. We intuitively expected that lineages with a more similar evolutionary process would attract each other due to a greater concordance on the rates across sites, or alternatively that lineages with more divergent evolutionary processes would repel accordingly. Higher pairwise site rate category conservation coefficients {theta}s (product of individual {theta} along connecting nodes) should cluster lineages with increasing confidence as more data becomes available.

The Bad
In the first simulation, we fixed {theta}ab (Fig. 1b) on the inner branch connecting the a and b groups to 1 so that there is no intrinsic rate category shuffle along this branch. {theta}aa1 = {theta}bb1 = {theta}1 and {theta}aa2 = {theta}bb2 = {theta}2 were assigned arbitrary values between 0 and 1 in steps of 0.1 (Fig. 4a). Initially, for each of these cases we fixed all subtrees total heights, h, to a common average of one change per site, resulting in a symmetric and molecular clock-compatible phylogeny. The shape parameter value, {alpha}, was also set to 1 (i.e., a moderate to high degree of substitution rate heterogeneity among sites). Subtrees were composed of four taxa; the result of two rounds of duplication or speciation events on all lineages (d = 2). Each branch was twice as short as the one that leads to the parent node (r = 2). The inner branch was set to 0.04 changes per site. This last value was chosen so that our simulated data sets are forced towards the inconsistency parametric edge. Consequently, relatively small changes in parameter values can make the results wander from yielding the right topology to the wrong one with increasing probability as the size of the data set increases. Under these conditions, the bootstrap support for the right topology following an RAS model of evolution was 80% out of 100 replicates.


Figure 4
View larger version (62K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Figure 4 Plot of the mean RELL support value for the wrong topology ((a1, b1), a2, b2). (a) Simulated process: the conservation coefficient on the inner branch was fixed to 1 and the same values were assigned to non-sister subtrees: {theta}1 for a1 and b1, and {theta}2 for a2 and b2, respectively. Tree heights are set to 1 substitution per site, whereas the innermost branch length is set to 0.04. There are a total of four taxa per each group (d = 2) and each branch is twice as long as the next branch towards the tree leaves (r = 2). Rates are drawn from a gamma distribution {Gamma} ({alpha}) for all subtrees. (B) A plot of the mean RELL support value (z-axis) and the conservation coefficients {theta}1 (x-axis) and {theta}2 (y-axis). For this particular set of parameters, at very low conservation distance, |{theta}1{theta}2|, the mean RELL value approaches a constant low support value (expected under a RAS model), whereas it increases considerably when the difference is extreme (= 1).

 
A possible biological interpretation of such an arrangement is that smaller {theta} s indicate a greater dissimilarity between evolutionary constraints on different lineages. For instance, paralogs within a gene family with distinct functions would show a similar degree of disparity in within-site evolutionary rates. If {theta}1 = {theta}2, there is the same degree of covariance between all subtrees. Consequently, we expect to obtain a roughly random type of error with no systematic bias. However, if two lineages have a higher degree of rate category conservation; for example, when the two paralogous groups of sequences are functionally redundant, there will be a higher correlation of site rates and a stronger attraction under noncovarion models between these two lineages. The opposite can be also true for the two lineages with lower site rate category conservation. Thus we would expect to see wrong topology resolution as the difference between {theta} 1 and {theta} 2 increases. In fact, our simulations, based on a symmetrical phylogeny where all subtrees have the same height, show that such an increase leads from reasonable support for the true topology towards false support for the hypothesis in which lineages with greater site rate conservation appear together (Fig. 4b).

Further, we took a particular combination of {theta} values ({theta}1 = 1, {theta}2 = 0.2) and explored the effect of changing values for key characteristics of the input phylogeny at different data sets sizes (Fig. 5). As expected, the increase of the amount of phylogenetic signal (inner branch length) neutralizes systematic error effects. On the other hand, the increase of subtree height does reinforce systematic error because of the loss of phylogenetic signal in the innermost branch. Covarion bias relies on the existence of heterogeneity of rates among sites because no site rate categories would be considered otherwise. Accordingly, a lower shape parameter value (more severe heterogeneity) favors resolution for the wrong topology, whereas a higher value does just the opposite. Besides, proper taxon sampling may reduce the effect of bias. In fact, adding more taxa reduces the bias strength. Short deep and long-tip branches seem to be less problematic than long deep and short-tip branches are. Thus, we can conclude that the characteristics of the evolutionary process at the subtrees have a remarkable effect on phylogeny recovery because long deep branches seem to have a greater impact, something that is in agreement with previous studies (Kim, 1996, 1998; Bremer et al., 1999).


Figure 5
View larger version (100K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Figure 5 Effect that changes in simulation parameters have on the RELL support for the three possible topologies. We investigate the marginal effect of five different parameters (one per each diagram column), fixing all other parameters to the values used in the surface plot in Figure 4: inner branch length {0.01, 0.02, 0.04, 0.08, 0.16}, shape parameter ({alpha}) {0.05, 0.25, 0.5, 1, 1.5, 2, 5}, subtree common height (h) {0.25, 0.5, 0.75, 1, 1.5, 2}, branch length factor (r) {0.25, 0.5, 1, 2, 4, 8}, and number of taxa per subtree (d2) {1, 2, 4, 8, 16}. We tested these changes on a particular combination of conservation coefficients {theta}1 = 1 and {theta}2 = 0.2 situated at an average height on the plotted slope in Figure 4. These values are close to the inconsistency critical region. For each combination of values, we took the mean RELL support of 50 replicates for all three possible topologies. We performed the experiment for a number of different alignment lengths, although here we show results just for 1000, 3000, and 10,000 sites. There are no data for 10,000 sites and 16 taxa per subtree because it was computationally prohibitive.

 
Finally, within-site rate heterogeneity among lineages can also alter the outcome significantly. Among all alternative combinations studied, it is only the ones that elongate or shorten lineages together with the same {theta} values that have a pronounced effect in comparison with the results obtained with the symmetric settings (Fig. 6). In fact, bias increases when lineages with high {theta} values are accelerated, whereas it decreases when lineages with low {theta} values accumulate a greater number of changes. We also studied the effect that different numbers of taxa among subtrees has on producing such phylogeny biases. Results here are qualitatively parallel to the ones obtained by tree height asymmetry, with accelerated lineages and more conserved lineages having equivalent effects to poorly sampled and well-sampled lineages, respectively. This, together with the results presented in Figure 5, suggests that long deep branches (caused by rate acceleration, poor sampling or high tree branch length ratio values) may have a major role in catalyzing phylogenetic bias.


Figure 6
View larger version (104K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Figure 6 Effect of subtree height asymmetry on the covarion bias strength. Here we investigate the effect of five distinct relative subtree height rearrangements. We simulated data sets of variable data lengths, although here we show results just for 1000 and 10,000 sites alignments. As in Figure 5, groups a1 and b1 share conservation coefficient {theta}1 = 1, whereas a2 and b2 share a different and much lower value {theta}2 = 0.2. In order to create variability in subtree heights, we study five different assortments where two or tree subtrees share the same height (h) and the remaining share the inverse (h–1). h = 1 means the original symmetric tree that becomes more asymmetric as h approaches 0 or infinite.

 
The Good
For the second simulation scenario, we allowed {theta}ab to vary from 0 to 1 in steps of 0.1, while we fixed the {theta} for each external branch to the same value {theta}12 taking the same set of values as formerly (Fig. 7a). The tree shape is the same as for "the bad" case, although here we set up the internal branch to 0.01 to hamper the recovery of the correct topology (48% bootstrap out of 100 repeats). Intuitively, simplistic assumptions may actually assist in obtaining the right topology, as closely related taxa should more often share evolutionary traits not accounted for by the model. Accordingly, decreasing {theta}ab (no rate category conservation across the inner branch) and increasing subtree {theta} s (conservation between sister groups) yields high support for the true topology (Fig. 7b) despite the limited phylogenetic signal available.


Figure 7
View larger version (93K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Figure 7 Plot of the mean RELL support value for the right topology ([a1, a2],b1,b2) with limited phylogenetic signal (inner branch length = 0.01). (a) In this simulation, the inner branch had a variable conservation {theta}ab, taking values from 0 to 1 in steps of 0.1, whereas the branches leading to each one of the subtrees share the same conservation coefficient {theta}12 that takes the same set of values. Under this procedure, the coefficient between sister subtrees is {theta}12 whereas between non-sister subtrees is {theta}ab· {theta}12. (b) Plot that illustrates the effect of the difference between conservation coefficients on the mean RELL value.

 
Although this may seem to be preferable for obvious reasons, we may well want to have an objective measure of the evidence towards each hypothesis. This same reasoning was used to debunk the arguments concerning the superiority of parsimony over likelihood in the so-called Farris zone (Siddall et al., 1998; Swofford et al., 2001), which is part of the long-standing literature battle regarding long-branch attraction effects and the performance of maximum likelihood and parsimony (e.g., Sullivan and Swofford, 2001; Kolaczkowski and Thornton, 2004; Gadagkar and Kumar, 2005; Spencer et al., 2005; Gaucher and Miyamoto, 2005; Brinkmann et al., 2005).

The Ugly
In the third simulated scenario, we explored the effect of distinct shape parameter values under different degrees of site rate category conservation. Two pairs of phylogenetic groups evolved under a gamma RAS model, with the first pair of subtrees (a1, b1) using a shape parameter, {alpha}1, and the other two subtrees using a different shape parameter, {alpha}2. For the inner branch, we used one of the pair values, {alpha}ab = {alpha}1 (Fig. 8b) and an interpolated version of both values, {alpha}ab = exp [ln({alpha}1) + ln({alpha}2)]/2 (Fig. 8c, Fig. 8d, Fig. 8e, and Fig. 8f). This means that even if a site belongs to the same rate category throughout the phylogeny, its relative rate will change among lineages depending on the difference in site rate distributions. Once more, we intuitively expected to reproduce systematic error that would join lineages with convergent rate among site distributions {alpha}1!= {alpha}2. Topology and branch lengths parameters are set up as in the good case including weak phylogenetic signal, so that we can see support oscillations in both directions (pro and against the wrong rate distribution convergent topology). We also simulated different shape parameters under different conservation levels between 1 and 0 in steps of 0.25 but maintained them equal among all subtrees (Fig. 8a) so that:


Formula 11

(11)


Figure 8
View larger version (274K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Figure 8 Plot of the mean RELL support value for the wrong topology ([a1, b1], a2, b2) with different rate-across-site sampling distributions. (a) The simulated process: initially, rate category conservation is full along the inner branch ({theta}ab = 1) and equals {theta}12 for all branches leading to subtrees, so that the pairwise conservation coefficient is equal for all pairs of groups. RAS distribution shape parameters change across subtrees, {alpha}1 for a1 and b1, and {alpha}2 for a2 and b2. We assigned two different values to the shape parameter along the inner branch: we took one of the group values {alpha}ab = {alpha}1 (b) or alternatively used an interpolated value {alpha}ab = exp [ln ({alpha}1) + ln ({alpha}2)]/2 (c, d, e, and f). We simulated 100 repeats at shape parameter values {alpha}1 and {alpha}2 isin {exp (x) | x isin {–5, –4,...,5}}. (b and c) The effect of different combinations of shape parameters in the mean RELL support value where all {theta}s are set to 1. (d, e, and f) Changes in RELL support surface in (c) as we decreased the coefficient {theta}12 to 0.75, 0.5, and 0.25, respectively. Bias fades away as more sites are given different random rate categories per subtree. Convergent RAS distributions have a weakening effect in this case.

 
The outcome is quite surprising. When {theta}s are high (close to 1), for the most part, different shape parameter values drag together those subtree pairs with equal rate distribution shapes (Fig. 8b). Nonetheless, greater difference in shape parameters does not necessarily produce greater attraction between convergent subtrees. Moreover, there are zones where support falls below the level achieved when there are no differences between rate distributions ({alpha}1 = {alpha}2). Consequently, repulsion takes place between lineages with the same shape parameter. In this particular simulation batch, this phenomenon occurred when medium/high heterogeneity (around {alpha} = 1) and high heterogeneity (when {alpha} approaches zero) are combined. In fact, these sinuous support surface changes as the relative subtree heights vary to break the initial clock like overall structure (Fig. 9) or when the inner branch shape parameter is set up differently (Fig. 8b versus 8c).


Figure 9
View larger version (67K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Figure 9 Effect of subtree height differences in the RELL support surface for the wrong topology ([a1, b1], a2, b2). (a) Tree with the wrong topology. (b) RELL support surface where the mean support RELL value is plotted against the natural logarithm of the shape parameter for the substitution rates in long branches ({alpha}1) and short branches ({alpha}2). We used the same parameterization as for Figure 8c but changed subtree heights to 2 for clades a1 and b1, and 0.5 changes for clades a2 and b2.

 
This may be due to a delicate equilibrium between counterpoised attraction and repulsion effects between all lineages because a single shape parameter is used to maximize the global likelihood, despite that each combination of two subtrees would have a different marginally best-fitting pairwise shape parameter. Susko et al. (2004) concluded that the bias strength and the inconsistency zone would depend on the relative branch lengths.

Nonetheless, one may question the biochemical relevance of such a particular scenario; changes on the overall distribution of site rates under the RAS model or its tuning parameters must be accompanied by arbitrary changes in site rate categories, resulting in decreasing {theta}s. We simulated the same shape parameter subspace at different overall rate category conservation coefficients (Fig. 8c, Fig. 8d, Fig. 8e, and Fig. 8f). The outcome revealed that repulsion or attraction between subtrees caused by convergence in the sampling distribution of rates between lineages strongly depends on the conservation of site relative rates. This puts forward the conclusion that site rate category dependency between lineages is required for attraction or repulsion over possible influences of relative rate probability distribution differences or similarities. Therefore, different RAS distributions cannot be used alone as an argument to postulate possible systematic bias in a data set or to indicate what lineages should attract.

Visualization and Test
Using the visualization method described in Methods, we plotted data sets simulated under different model conditions including the first and second parameter subspaces previously investigated. A regular RAS process ({theta} = 1 for all branches) generates a balanced density surface (Fig. 10a) around the center of the triangle, although segments from the center to the vertices do accumulate a slightly greater amount of points. Covarion unbiased data sets ({theta} values are 0 for all the subtree-leading branches and {theta} = 1 for the innermost branch) produce a visibly different point distribution with similar amount of points cumulating towards each hypothesis vertex (Fig. 10b). In contrast, covarion-biased data sets generated by correlating the conservation coefficients between pairs of subtrees yield unbalanced plots (Fig. 10c, Fig. 10d and Fig. 10f). Plot differences between biased and unbiased alignments are evident to the naked eye. The overall plot distribution may change if subtrees have clearly distinct shapes (number of taxa, branch length, and topology) as illustrated in Figure 10e and Figure 10f. Nonetheless, even in these circumstances, the averaged mean point position remains close to the center of the triangle in genuine RAS data sets.


Figure 10
View larger version (363K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Figure 10 Example of bias visualization plots based on the layout described in Figure 3. These are the product of 10000 amino acid long alignments after smoothing the resulting scattered site plot. Darker areas indicate a greater concentration of site points. The circle indicates the center of the triangle, whereas the x points to the mass center of the cloud. (a) Plot resulting from a RAS substitution process. In this figure we used the parameterization of "the bad" case but fixing all {theta}s values to 1. (b) "Unbiased" covarion process: as in (a) but setting all conservation coefficients {theta} to zero so that rates are selected at random per each lineage. (c) A covarion evolving data set in which non-sister lineages attract: "the bad" case where {theta}1 = 1 and {theta}2 = 0. (d) Covarion process in which actual sister lineages attract: "the good" case where {theta}ab = 1 and {theta}12 = 0. (e) Represents an RAS process with a combination of different number of taxa and subtree heights. Subtrees a1 and b1 are 2 substitutions tall and have 8 taxa each, whereas the height of subtrees a2 and b2 is set to 0.25 substitutions and contain just 2 taxa each. (f) "The bad" case applied to (e) setting where {theta}1 = 1 and {theta}2 = 0. The distribution of dots clearly changes if subtrees are asymmetric. Biased data sets differ in that points migrate to the favored hypothesis vertex. Consequently, the sample mass center clearly breaks off from the center of the triangle (c, d, and f).

 
These features match the resulting real or artifactual tree support obtained by ML tree reconstruction. Accordingly, application of the proposed test on these simulated data sets allows the detection of deviations with increasing sensitivity as the bias strength grows (Fig. 11). Sensitivity here is measured as the percentage of positives with a significance level of 0.05 in 1000 amino acid long alignments.


Figure 11
View larger version (54K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Figure 11 Sensitivity (solid line) of the proposed test at different values for model parameters that cause bias, plotted against the average RELL value of the favored hypothesis (dashed line). (a) "The bad" case, distantly related lineages attract as difference between conservation coefficient | {theta}1{theta}2| increases from 0 to 1 (Fig. 4). {theta}1 Takes values from 0.5 to 1 in steps of 0.05. {theta}2 is set to 1 –{theta}1. (b) On "the good" case when closely related lineages attract, the inner branch conservation coefficient {theta}ab varies from 0 to 1 is steps of 0.1. Other coefficients on branches leading to the four subtrees are set to 1 (Fig. 7). (c) Again on the second simulation batch, but in this case as lineage conservation coefficients vary from 0 to 1 and the inner branch coefficient is set to 1. (d) In "the ugly" case, setting {alpha}ab = {alpha}1 = 1, all {theta}s = 1 and giving {alpha}2 values in {exp (x) | x isin (–2, –1.5, –1, ..., 3)}. Notice that in this particular case the bias is pro the wrong topology in when {alpha}1 < {alpha}2[ln ({alpha}2) > 0] and pro the right topology when {alpha}1 > {alpha}2. In all cases we repeated the experiment 100 times.

 
Regarding "the ugly" case, the subtree pairwise rate difference ranking described above also detects parametric areas where systematic error occurs (Fig. 11d). Nonetheless, it seems not to be able to discern between attraction and repulsion, supporting the hypothesis that joins subtrees with similar site rate distributions. The fact that the parametric area of attraction and repulsion changes as different modeling elements are altered suggests that this will not be easily resolvable with an ad hoc approach. Here, the use of the right or fairly well approximated model of evolution is required. This issue will require further investigation; for the moment, we recommend to perform simulations when there is an important difference between shape distributions in different regions of the phylogeny.


    Remarks
 Top
 Notes
 Abstract
 Methods
 Results and Discussion
 Remarks
 Appendix
 Acknowledgments
 REFERENCES
 
It is important to find solutions to the limitations of current phylogenetic methods and evolutionary models. Incorrect phylogenetic resolution may invalidate any result obtained by downstream analyses. A first step towards this aim is to characterize peculiarities of evolutionary processes that may cause bias and provide assessing tools to localize their occurrence in real data sets.

In this work, we illustrated possible misleading effects of some violations of RAS model assumptions regarding evolutionary rates among sites and lineages. Here we extend the observation done by previous work on this aspect of molecular evolution (Chang, 1996; Kolaczkowski and Thornton, 2004), with a more general and relaxed heterogeneous model of evolution. Bias is caused by different degrees of site rate drift among lineages rather than specially tailored conflictive rate convergence. This phenomenon may also be reproducible when assuming a process accounting for continuous changes in site rates categories (Tuffley and Steel, 1998; Galtier, 2001) that allows for different rate-of-rates among lineages. We investigated different overlapping effects of some model parameters on the intensity of the systematic error caused by covarion-like evolution. It is important to note the importance that taxa sampling may have in reducing the resulting systematic error. Also, we tentatively proposed a tool to visualize and test for the presence of bias on four well-supported subtree phylogenies based on simple geometric concepts. In other words, we test the suitability of using RAS models for resolving the deepest branches of the phylogenetic tree. Reliability is especially dubious at this depth, as any posterior fixed changes (rest of the tree height) constitute noise regarding the resolution of these relationships. The closest procedure to our approach has been to check, normally a posteriori, the changes on the support values after selective resampling of data (Lockhart et al., 1998; Inagaki et al., 2004; Pisani, 2004; Brinkmann et al., 2005).

Although here we have focused only on quartet analysis, it may be extendable to a greater number of monophyletic groups by means of quartet puzzling–like techniques (Strimmer and von Haeseler, 1996) or porting the geometrical framework described herein to a higher dimensional space. In order to estimate discrepancies in rates among lineages, we needed to have a reasonable number of taxa per each group. Therefore this method cannot be used accurately with poorly taxed groups. Unfortunately, this is not an uncommon situation especially in studies performed on poorly sequenced taxonomy groups. Perhaps this may be avoided using the other dimension of the data, increasing the number of sites compared each time using a sliding window.

In any case, it is important to bear in mind that there is more than a single source of systematic error; for example, compositional heterogeneity (Mooers and Holmes, 2000; Foster, 2004). Moreover, bias may reinforce the right topology, as close phylogenetic relatives should intuitively tend to share traits that may not be accounted for by mainstream models. Therefore, thoughtful phylogeny curation is necessary in order to accurately assess the cause and effect of such model singularities.

We may devise some correcting protocols and alternative methods inspired by the simulation model and test proposed. Certainly, in molecular data sets, the accuracy of ML and Bayesian analyses may increase by including more sets of parameters in existing models such as the conservation coefficient per branch or subtree and by accounting for within-site rate changes. In this context, one could use likelihood-ratio tests or alternative model sorting criteria to determine whether the enhanced model explains the data significantly better and whether the simplified model is susceptible to bias. However, that may unbearably increase the computational cost and risk of overfitting. Model-based distance methods such as neighbor-joining could benefit from the fact that each pair distance need not to be estimated using the same global model. A plausible alternative is then to generate a mixed distance matrix, where many intersubtree or intertaxa models approximate better the evolutionary path between pairs of monophyletic groups. Another possibility is to perform a priori site-weighted bootstrapping or removal in order to improve data fitness for the model of choice. Although this last option may not seem to be the most attractive, as we may need to discard a sizable portion of the data, it is in fact constantly being done; e.g., leaving out the third codon position in coding DNA analyses.

In summary, diagnostic analysis approaches, like the one proposed in this work, may complement existing methods such as the maximum-likelihood and Bayesian ones and provide a visual statistical framework to take into account hidden parameters (covarion evolution, compositional bias, an so forth) that, when ignored, may hamper accurate phylogenetic inference.


    Appendix
 Top
 Notes
 Abstract
 Methods
 Results and Discussion
 Remarks
 Appendix
 Acknowledgments
 REFERENCES
 
Test Equation Demonstration
Under an RAS model of evolution and considering subtree with similar tree shapes, the bias plot proposed here must look the same (present the same density of points) from any hypothesis (tree topology) perspective. Let us define the pair (XH,YH) as the mean point location of the plot when we set the origin of coordinates on the triangle center and when the y-axis is set to pass through the hypothesis H vertex based on the layout depicted in Figure 3. Then based on the symmetry assumption, variance should be equal for all three coordinate systems:


Formula 12

(12)

It is easy to transform a point (X, Y) based on a coordinate system to another that is a simple rotation of the former with angle {lambda}:


Formula 13

(13)

Knowing that:


Formula 14

(14)
and using the transformation Equation (2), the variance for hypotheses H0 or H2 ({lambda} is 2 {pi}/3 and –2{pi}/3, respectively) can be expressed as a function of the variance for hypothesis H1 as follows:


Formula 15

(15)

Let us summarize Equation (4) using the following notation:


Formula 16

(16)

As sin({lambda}) = –sin (–{lambda}) and cos ({lambda}) = cos (–{lambda}), it follows that:


Formula 17

(17)
Consequently, the covariance term, C, must be null for all three hypotheses:


Formula 18

(18)

This last equality is satisfied if and only if {lambda} is multiple of {pi}/2 (not the case) or when the covariance is zero. Taking this into account we can easily prove that the variance of YH1 must be the same as that of XH1 to satisfy the symmetry condition:


Formula 19

(19)
This equation only holds if:


Formula 20

(20)
because cos 2({lambda}) + sin 2({lambda}) = 1.

Here we consider the fact that for any angle {lambda} both coordinates have equal variances and their covariances are equal to zero. Assuming that the site position obtained constitutes an independent sample, the central limit theorem (CLT) ensures that the distribution of the means approaches a Gaussian and, because it has the same variance from every angle x-axis or y-axis, it must approximate a circular 2-D bell centered on the triangle barycenter where the standard error is {sigma}x/{surd}N, N being the sample size. Nonetheless, because site point coordinates are calculated based on parameter estimators obtained using the same site sample, it is not certain that the independence condition for CLT is actually met with a limited amount of data. Consequently, we need to confirm any conclusion below through simulations.

We can determine the formula for the cumulate probability of the distance to the center taking as starting point the bivariate normal distribution or the chi-square. Here we develop the former. As the covariation is null, the mean is zero and the deviation is the same from every axis:


Formula 21

(21)

For further simplification, we integrate the probability volume along the axis y = 0, within certain distance x using the circle perimeter equation as follows:


Formula 22

(22)

We confirmed these results performing simulations and testing the goodness-of-fit of the P-value and mean point distributions using Kolmogorov-Smirnov test against a uniform and bivariate normal distribution, respectively. The simulations show that this constitutes a good approximation even with asymmetric data sets (different subtree heights and number of taxa) up to 10,000 sites in length. Here the distance must be taken from the mean point centre estimated through parametric bootstrapping. However, using general samples, mean or median comparison tests like Wilcoxon test are more recommendable in these cases.


    Acknowledgments
 Top
 Notes
 Abstract
 Methods
 Results and Discussion
 Remarks
 Appendix
 Acknowledgments
 REFERENCES
 
We would like to thank Davide Pisani for his valuable comments on the manuscript. We are also thankful to Professor Dan Bradley and Simon A. Travers for revising the grammar and spelling of the manuscript. Also we would like to express our gratitude to the editor and to both anonymous reviewers whose comments and suggestions were vital for the development of the final version of the manuscript. Computational resources were provided by Science Foundation Ireland (SFI) under the basic research grant program awarded to M.A.F. We would also like to express our gratitude to the computational resources hosted by the Bioinformatics Laboratory and Computer Science Department at the National University of Ireland, Maynooth.


    Notes
 Top
 Notes
 Abstract
 Methods
 Results and Discussion
 Remarks
 Appendix
 Acknowledgments
 REFERENCES
 
2 Present Address: Evolutionary Genetics and Bioinformatics Laboratory, Department of Genetics, Smurfit Institute of Genetics, University of Dublin, Trinity College, Dublin 2, Ireland; E-mail: faresm{at}tcd.ie (M.A.F.) Back


    REFERENCES
 Top
 Notes
 Abstract
 Methods
 Results and Discussion
 Remarks
 Appendix
 Acknowledgments
 REFERENCES
 

    Ane C., Burleigh J. G., MaMahon M. M., Sanderson M. J. Covarion structure in plastid genome evolution: A new statistical test. Mol. Biol. Evol. (2005) 22:914–924.[Abstract/Free Full Text]

    Bremer B., Jansen R. K., Oxelman B., Backlund M., Lantz H., Kim K. J. More characters or more taxa for a robust phylogeny—Case study from the coffee family (Rubiaceae). Syst. Biol. (1999) 48:413–435.[Abstract/Free Full Text]

    Brinkmann H., Giezen M., Zhou Y., Raucourt G. P., Philippe H. An empirical assessment of long-branch attraction artefacts in deep eukaryotic phylogenomics. Syst. Biol. (2005) 54:743–757.[Abstract/Free Full Text]

    Brinkmann H., Philippe H. Archaea sister group of Bacteria? Indications from tree reconstruction artefacts in ancient phylogenies. Mol. Biol. Evol. (1999) 16:817–825.[Abstract]

    Chang J. T. Inconsistency of evolutionary tree topology reconstruction methods when substitution rates vary across characters. Math. Biosci. (1996) 134:191–223.

    Drummond A., Strimmer K. PAL: An object-oriented programming library for molecular evolution and phylogenetics. Bioinformatics (2001) 17:662–663.[Abstract/Free Full Text]

    Fitch W. M., Markowitz E. An improved method for determining codon variability in a gene and its application to the rate of fixation of mutations in evolution. Biochem. Genet. (1970) 4:579–593.[CrossRef][Web of Science][Medline]

    Foster P. G. Modelling compositional heterogeneity. Syst. Biol. (2004) 53:485–495.[Abstract/Free Full Text]

    Gadagkar S. R., Kumar S. Maximum-likelihood outperforms maximum parsimony even when evolutionary rates are heterotachous. Mol. Biol. Evol. (2005) 22:2139–2141.[Abstract/Free Full Text]

    Galtier N. Maximum-likelihood phylogenetic analysis under a covarion-like model. Mol. Biol. Evol. (2001) 18:866–873.[Abstract/Free Full Text]

    Gaucher E. A., Miyamoto M. M. A call for likelihood phylogenetics even when the process of sequence evolution is heterogeneous. Mol. Phylogenet. Evol. (2005) 37:928–931.[CrossRef][Web of Science][Medline]

    Gu X. Statistical methods for testing functional divergence after gene duplication. Mol. Biol. Evol. (1999) 16:1664–1674.[Abstract]

    Gu X. Maximum-likelihood approach for gene family evolution under functional divergence. Mol. Biol. Evol. (2001) 18:453–464.[Abstract/Free Full Text]

    Huelsenbeck J. P. Systematic bias in phylogenetic analysis: Is the Strepsiptera problem solved? Syst. Biol. (1998) 47:519–537.[Web of Science][Medline]

    Huelsenbeck J. P. Testing a covariotide model of DNA substitution. Mol. Biol. Evol. (2002) 19:689–707.[Abstract/Free Full Text]

    Inagaki Y., Susko E., Fast N., Roger A. Covarion shifts cause a long-branch attraction artifact that unites microsporidia and archaebacteria in EF-1alpha phylogenies. Mol. Biol. Evol. (2004) 21:1340–1349.[Abstract/Free Full Text]

    Jones D. T., Taylor W. R., Thornton J. M. The rapid generation of mutation data matrices from protein sequences. Comput. Appl. Biosci. (1992) 8:275–282.[Abstract/Free Full Text]

    Juke T. H., Cantor C. R. Evolution of protein molecule. In: Mammalian protein metabolism—Munro H. N., ed. (1969) New York: Academic Press. 21–132.

    Kim J. General inconsistency conditions for maximum parsimony: Effects of branch lengths and increasing number of taxa. Syst. Biol. (1996) 45:363–374.[Abstract/Free Full Text]

    Kim J. Large-scale phylogenies and measuring the performance of phylogenetic estimators. Syst. Biol. (1998) 47:43–60.[Abstract/Free Full Text]

    Kim J. Slicing hyperdimensional oranges: The geometry of phylogenetic estimation. Mol. Phylogenet. Evol. (2000) 17:58–75.[CrossRef][Web of Science][Medline]

    Kishino H., Hasegawa M. Evaluation of the maximum likelihood estimate of the evolutionary tree topologies from DNA sequence data, and the branching order in hominoidea. J. Mol. Evol. (1989) 29:170–179.[CrossRef][Web of Science][Medline]

    Knudsen B., Miyamoto M. M. A likelihood ratio test for evolutionary rate shifts and functional divergence among proteins. Proc. Natl. Acad. Sci. USA (2001) 98:14512–14517.[Abstract/Free Full Text]

    Kolaczkowski B., Thornton J. W. Performance of maximum parsimony and likelihood phylogenetics when evolution is heterogeneous. Nature (2004) 431:980–984.[CrossRef][Medline]

    Lockhart P. J., Steel M. A., Barbrook A. C., Huson D. H., Charleston M. A., Howe C. J. A covariotide model explains apparent phylogenetic structure of oxygenic photosynthetic lineages. Mol. Biol. Evol. (1998) 15:1183–1188.[Abstract]

    Lopez P., Casane D., Philippe H. Heterotachy, an important process of protein evolution. Mol. Biol. Evol. (2002) 19:1–7.[Abstract/Free Full Text]

    Lopez P., Forterre P., Philippe H. The root of the tree of life in the light of the covarion model. J. Mol. Evol. (1999) 49:496–508.[CrossRef][Web of Science][Medline]

    Mayrose I., Graur D., Ben-Tal N., Pupko T. Comparison of site-specific rateinference methods for protein sequences: Empirical Bayesian methods are superior. Mol. Biol. Evol. (2004) 21:1781–1791.[Abstract/Free Full Text]

    Mooers A. O., Holmes E. C. The evolution of base composition and phylogenetic inference. Trends Ecol. Evol. (2000) 15:365–369.[CrossRef][Medline]

    Penny D., McComish B. J., Charleston M. A., Hendy M. D. Mathematical elegance with biochemical realism: Covarion model of molecular evolution. J. Mol. Evol. (2001) 53:711–723.[CrossRef][Web of Science][Medline]

    Philippe H., Zhou Y., Brinkmann H., Rodrigue N., Delsuc F. Heterotachy and long-branch attraction in phylogenetics. BMC Evol. Biol. (2005) 5:50.[CrossRef][Medline]

    Pisani D. Identifying and removing fast-evolving sites using compatibility analysis: An example from the Arthropoda. Syst. Biol. (2004) 53:978–989.[Free Full Text]

    Pupko T., Pe'er I., Shamir R., Graur D. A fast algorithm for joint reconstruction of ancestral amino acid sequences. Mol. Biol. Evol. (2000) 17:890–896.[Abstract/Free Full Text]

    Siddall M. E. Success of parsimony in the four-taxon case: Long-branch repulsion by likelihood in the Farris zone. Cladistics (1998) 14:209–302.[CrossRef][Web of Science]

    Spencer M., Susko E., Roger A. J. Likelihood, parsimony, and heterogeneous evolution. Mol. Biol. Evol. (2005) 22:1161–1164.[Abstract/Free Full Text]

    Strimmer K., von Haeseler A. Quartet puzzling: A quartet maximum likelihood method for reconstructing tree topologies. Mol. Biol. Evol. (1996) 13:964–969.[Web of Science]

    Strimmer K., von Haeseler A. Likelihood-mapping: A simple method to visualize phylogenetic content of a sequence alignment. Proc. Natl. Acad. Sci. USA (1997) 94(13):6815–6819.[Abstract/Free Full Text]

    Sullivan J., Swofford D. L. Should we use model-based methods for phylogenetic inference when we know that assumptions about among-site rate variation and nucleotide substitution pattern are violated? Syst. Biol. (2001) 50:723–729.

    Susko E., Inagaki Y., Field C., Holder M. E., Roger A. J. Testing for differences in rate-across-site distributions in phylogenetic subtrees. Mol. Biol. Evol. (2002) 19:1514–1523.[Abstract/Free Full Text]

    Susko E., Inagaki Y., Roger A. J. On inconsistency of the neighbor-joining, least squares, and minimum evolution estimation when substitution processes are incorrectly modeled. Mol. Biol. Evol. (2004) 21:1629–1642.[Abstract/Free Full Text]

    Susko E., Spencer M., Roger A. J. Biases in phylogenetic estimation can be caused by random sequence segments. J. Mol. Evol. (2005) 61:351–359.[CrossRef][Web of Science][Medline]

    Swofford D. L., Waddell P. J., Huelsenbeck J. P., Foster P. G., Lewis P. O., Rogers J. S. Bias in phylogenetic estimation and its relevance to the choice between parsimony and likelihood methods. Syst. Biol. (2001) 50:525–539.[Free Full Text]

    Tuffley C., Steel M. Modeling the covarion hypothesis of nucleotide substitution. Math. Biosci. (1998) 147:63–91.[CrossRef][Web of Science][Medline]

    Uzzell T., Corbin K. W. Fitting discrete probability distributions to evolutionary events. Science (1971) 172:1089–1096.[Abstract/Free Full Text]

    Yang Z. PAML: A program package for phylogenetic analysis by maximum likelihood. Comput. Appl. Biosci. (1997) 13:555–556.[Free Full Text]

    Zhang J., Nei M. Accuracies of ancestral amino acid sequences inferred by the parsimony likelihood and distance methods. J. Mol. Evol. (1996) 44:S139–A146.[CrossRef][Web of Science]


Add to CiteULike CiteULike   Add to Connotea Connotea   Add to Del.icio.us Del.icio.us    What's this?


This article has been cited by other articles:


Home page
Syst BiolHome page
N. C. Sheffield, H. Song, S. L. Cameron, and M. F. Whiting
Nonstationary Evolution and Compositional Heterogeneity in Beetle Mitochondrial Phylogenomics
Syst Biol, August 1, 2009; 58(4): 381 - 394.
[Abstract] [Full Text] [PDF]


Home page
Syst BiolHome page
F. A. Matsen and M. Steel
Phylogenetic Mixtures on a Single Tree Can Mimic a Tree of Another Topology
Syst Biol, October 1, 2007; 56(5): 767 - 775.
[Abstract] [Full Text] [PDF]


This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (PDF) Freely available
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in ISI Web of Science
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrow Search for citing articles in:
ISI Web of Science (7)
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Ruano-Rubio, V.
Right arrow Articles by Fares, M. A.
Right arrow Search for Related Content
PubMed
Right arrow Articles by Ruano-Rubio, V.
Right arrow Articles by Fares, M. A.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?