Skip Navigation

Systematic Biology 2005 54(6):900-915; doi:10.1080/10635150500354829
This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (PDF) Freely available
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in ISI Web of Science
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrow Search for citing articles in:
ISI Web of Science (8)
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Bevan, R. B.
Right arrow Articles by Bryant, D.
Right arrow Search for Related Content
PubMed
Right arrow Articles by Bevan, R. B.
Right arrow Articles by Bryant, D.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?

© 2005 Society of Systematic Biologists

Calculating the Evolutionary Rates of Different Genes: A Fast, Accurate Estimator with Applications to Maximum Likelihood Phylogenetic Analysis

Edited by Tim Collins

Rachel B. Bevan1, B. Franz Lang2 and David Bryant1

1 McGill Centre for Bioinformatics Duff Medical Building, 3775 University Street, Montréal, Quebec, H3A 2B4, Canada; E-mail: rachel{at}mcb.mcgill.ca. (R.B.B.)
2 Program in Evolutionary Biology, Canadian Institute for Advanced Research; Centre Robert Cedergren, Département de Biochimie, Université de Montréal 2900 Boulevard Edouard-Montpetit, Montréal, Québec, H3T 1J4, Canada


    Abstract
 Top
 Abstract
 Methods
 Results and Discussion
 Conclusion
 Appendix 1. Formula for...
 Appendix 2. Fast Algorithm...
 Appendix 3. The DistR...
 Acknowledegements
 References
 
In phylogenetic analyses with combined multigene or multiprotein data sets, accounting for differing evolutionary dynamics at different loci is essential for accurate tree prediction. Existing maximum likelihood (ML) and Bayesian approaches are computationally intensive. We present an alternative approach that is orders of magnitude faster. The method, Distance Rates (DistR), estimates rates based upon distances derived from gene/protein sequence data. Simulation studies indicate that this technique is accurate compared with other methods and robust to missing sequence data. The DistR method was applied to a fungal mitochondrial data set, and the rate estimates compared well to those obtained using existing ML and Bayesian approaches. Inclusion of the protein rates estimated from the DistR method into the ML calculation of trees as a branch length multiplier resulted in a significantly improved fit as measured by the Akaike Information Criterion (AIC). Furthermore, bootstrap support for the ML topology was significantly greater when protein rates were used, and some evident errors in the concatenated ML tree topology (i.e., without protein rates) were corrected.

Keywords: Bayesian credible intervals; DistR method; multigene phylogeny; PHYML; rate heterogeneity

Received November 24, 2004; Revised March 18, 2005; Accepted May 24, 2005


It is widely recognized that the analysis of multiple unlinked genes is superior to single gene analyses for phylogenetic reconstruction. These unlinked genes may, however, be evolving according to very different rules. Heterogeneity of the evolutionary process must be accounted for in phylogenetic analyses (Bapteste et al., 2002; Bull et al., 1993; Huelsenbeck et al., 1996; Nylander et al., 2004; Pupko et al., 2002b; Yang, 1996). The concept of accounting for differing evolutionary pressures within phylogenetic analysis is not new ((Yang, 1993). Site-specific rates of evolution can be computed for amino acids (e.g., Rate4Site, Mayrose et al., 2004; Pupko et al., 2002a) and DNA (e.g., DNArates, Olsen et al., 1993) using both Bayesian and maximum likelihood approaches.

Site rates within a gene are likely to be more correlated than rates for sites in different genes. To account for this, it can be assumed that each gene evolves at a different average rate and that these gene rates are drawn from some common distribution (Cranston and Ranala, 2005; Felsenstein, 2001, 2004a). Both Bayesian (Huelsenbeck and Ronquist, 2001) and maximum likelihood (Pupko et al., 2002b; Yang, 1996) methods exist to estimate gene rates (or more generally, locus rates) but these are computationally expensive.

We present a fast, accurate method to estimate the relative evolutionary rates of genes/proteins. For example, when run on a data set with 63 proteins over 123 taxa the algorithm takes less than a second. The method can be applied to protein or nucleotide data, though here we focus on protein sequences. The basic idea is to use pairwise estimates of evolutionary divergence (distances) to deduce the relative rates of different proteins, even when the proteins are not all present in all of the taxa. Although this approach does not give the ML estimates for the rates (Pupko et al., 2002b, Yang, 1996), it does provide an excellent approximation.

After computing rates they are incorporated as extra parameters into the ML tree search, resulting in improved fit as measured by the AIC. The rates estimated using the DistR procedure have been coded into PHYML version 2.2, available at http://atgc.lirmm.fr/phyml/ (Guindon and Gascuel, 2003). PHYML was used because incorporation of the rates was straightforward and because PHYML is an especially fast implementation of ML.


    Methods
 Top
 Abstract
 Methods
 Results and Discussion
 Conclusion
 Appendix 1. Formula for...
 Appendix 2. Fast Algorithm...
 Appendix 3. The DistR...
 Acknowledegements
 References
 
The DistR Method
To begin with, the method will be explained through an example. Figure 1 represents three different protein alignments. Not all taxa are present in all three alignments. Suppose that the three proteins have rates r1, r2, and r3. These rates will affect distances inferred from the alignments. Reversing the problem involves using the pairwise distances between species to estimate the different rates r1, r2, and r3.


Figure 1
View larger version (24K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Figure 1 The general idea of the DistR estimation procedure. Beginning with individual protein alignments over a set of taxa (with missing data), distances between the species are estimated for each protein alignment. There are two choices of how to estimate the distances: directly from the alignment data (method 2); as the sum of the pairwise distances between taxa on a tree built from the alignment data (method 1). The result is a matrix of pairwise distances between taxa. The ratio of the pairwise distances to the rate of evolution of the protein should be approximately the same for all proteins.

 
Figure 1 outlines two ways of obtaining distances from each protein. In the first method ML trees are constructed and the length of the path between two taxa in these trees is measured (referred to hereafter as patristic ML distances). In the second method distances are estimated directly from the alignments, as is customary in distance-based methods (referred to hereafter as pairwise ML distances). The end result from both methods is a distance matrix for each protein.

If the rate in one protein is twice the rate in a second protein, then the expected distance estimates from the first protein should be twice the expected distance estimates from the second protein. This should hold, approximately, for both pairwise ML distances and patristic ML distances. Equivalently, the distance estimate from the first protein, divided by two, should be approximately the distance estimate of the second protein.

In the example (Fig. 1), and later on, the distance between taxa x and y estimated from protein k is denoted Formula , irrespective of whether it is a pairwise or patristic ML distance. Suppose that, for each k, the rate in protein k equals rk. It follows that Formula will be approximately equal to Formula which in turn will be approximately equal to Formula . This is denoted as


Formula 1

(1)
where "{approx}" means "approximately equal." In Figure 1, this gives Formula .

In a sense, the distance estimates obtained from each gene are normalized so that the scale is the same. Define this normalized distance or consensus distance between any two taxa as pxy, with the assumption that


Formula

Assume that rates r1, r2, and r3 in Figure 1 are unknown, whereas the distances remain known. The above approximate equality leads to


Formula 2

(2)
The unknowns pxy, r1, r2, and r3 can be solved for using a least squares approach.

The relation in Equation (2) provides a framework to solve for the relative ratesr1, r2, and r3, given estimates for the distances Formula . This is the basic idea behind the method. The main issues are how to (a) handle the fact that the relations are only approximate; (b) deal with missing distances; (c) compute the rate estimates quickly. These issues are addressed in the following text and in Appendix 2.

To formalize the problem, suppose that there are n proteins (or genes, etc.) over m species. The distance between species x and y derived from protein k is denoted Formula . The basic assumption made is that the ratio of the estimated distance between a pair of taxa for a given protein (Formula for protein k and taxa x,y), to the rate of the protein (rk for protein k), is approximately equal across all proteins.

The rates r1, r2, ..., rn are unknown quantities to be estimated based upon the distance data from a given protein alignment. To do this, assume that there exists an unknown consensus distance pxy such that


Formula

where n = 3 for the example in Figure 1. All the consensus distances and rates can now be estimated using a least-squares approach.

In the least squares method it is possible to incorporate measures of uncertainty about the estimated distances Formula . Distance estimates with low variance should contribute more to the analysis, whereas distance estimates with high variance (or infinite variance in the case of missing entries) should contribute little. Let Formula ≥ 0 be a measure of the uncertainty in the distance estimate between taxa x and y derived from protein k. If Formula is accurate, then Formula should be high. If there is less certainty about the accuracy of Formula , then Formula should be low. This is achieved using the inverse of the variance of Formula , that is, Formula . If protein k is not present in both x and y, then Formula = 0. To measure the variance of the distance estimates the approximate formula of Bulmer (1991) is used in the implementation of DistR. Other variance estimators could also be used.

Under a weighted least-squares (WLS) framework the total discrepancy between the ratios Formula and the consensus distances pxy is measured by


Formula 3

(3)
where p denotes the vector Formula and r denotes the vector [r1, ..., rn]T. This is similar to the minimization function used by Lapointe and Cucumel (1997) in the average consensus method. The main difference is that they assume one rate over all proteins, whereas this method includes different rates for each protein. Note that if taxa x and y are missing from a protein k then an estimate for Formula cannot be obtained. However, this is not a problem since the weight Formula will be zero in this case.

Estimating both rates and consensus distances using q(p,r) leads to the problem of nonidentifiability. In the absence of any error each estimated protein distance Formula is the product of the rate of the protein rk and the consensus distance pxy. Thus, a perfect fit to the equation is still achieved if all the rates are multiplied by some constant and all the consensus distances divided by the same constant. There is a problem of determining scale. Hence, Equation (3) does not have a well-defined minimum. To solve this problem a constraint


Formula 4

(4)
must be added to system, where {kappa} is an arbitrary positive constant. The particular value of {kappa} is irrelevant since changing {kappa} merely causes all estimated rates to be multiplied by the same constant value. For this reason, it is possible to infer relative rates only. In DistR Formula , thus constraining the weighted estimated distances to be equal to the weighted consensus distances. This was empirically determined to minimize the variance of the DistR estimates.

Appendix 3 describes an extremely fast algorithm for minimizing the function q(p, r) subject to the constraint in Equation (4). The algorithm takes O(nm2+n3) time and O(n2 + m2) memory. For example, when run on a data set with 63 proteins over 123 taxa, the algorithm takes less than a second. An implementation with source code is available at http://www.mcb.mcgill.ca/~rachel.

Experimental Studies
An extremely rapid method for estimating the relative rates of different genes has been proposed. The method is orders of magnitude faster than existing ML and Bayesian approaches. The most important question remaining is to what extent this increase in speed affects the accuracy of the estimates. In order to address this question, the accuracy of the new method was assessed using both simulated and empirical data.

In all the analyses PHYML (version 2.2) was used (Guindon and Gascuel, 2003) to compute ML distances and trees, with a JTT protein model, eight gamma categories plus invariant sites and the default (BIONJ) starting tree. The gamma shape parameter and proportion of invariant sites were estimated using default optimization routines in the program. When constructing ML trees from real data several bootstrap values were computed. As detailed below these values depend upon: whether patristic or pairwise ML distances were used in the DistR procedure; whether the rates were reestimated for each bootstrap replicate.

For both the simulated and empirical data, DistR estimates based upon patristic and ML distances were compared. This comparison was made in order to determine whether or not the additional computational effort required for estimating patristic ML distances is justified.

Experimental Studies—Simulated Data
The two key questions addressed through the simulation studies are:

  • Patristic versus pairwise ML distances.—How accurate are the rate estimates using pairwise versus patristic ML distances?
  • Missing distances between taxa.—How are DistR rate estimates affected when proteins are not present in all taxa?

To answer these questions protein alignments were simulated using Pseq-Gen (Grassly et al., 1997) with the JTT model of evolution. The initial tree and branch lengths were taken from an independent analysis of mitochondrial Atp8 proteins in 58 eukaryotes. Two types of simulations were carried out. The first, intended to address the first question, involved construction of 20 protein trees by randomly deleting taxa from the starting tree. In total there were four protein trees with 53 taxa, four with 48 taxa, four with 43 taxa, four with 38 taxa, and four with 33 taxa. For each tree a rate was sampled from a precomputed distribution of rates based on real data (data not shown), and protein alignments of length 100, 300, 500, and 1000 generated using Pseq-Gen (Grassly etal., 1997)(note that the average length of naturally occurring proteins is approximately 300 amino acids). The second analysis, intended to address the second question, increased the number of taxa deleted from the starting tree. In total there were seven trees with 25% of the taxa, seven with 50% of the taxa, and seven with 75% of the taxa. This resulted in 21 trees, 7 each with 16, 30, and 44 taxa, respectively. For each tree a rate was sampled from a precomputed distribution of rates based on real data (data not shown), and protein alignments of length 1000 generated using Pseq-Gen (Grassly et al.,1997). This experiment follows a protocol proposed by (Eulenstein et al., 2004). For both experiments, and for every set of parameters, 10 replicates of the experiment were performed. See Figure 2 for an overview of the simulations.


Figure 2
View larger version (11K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Figure 2 The general flow of the simulation studies. Two studies were performed, one with n = 20 and the other with n = 21 (where n is the number or proteins). The first study compared different methods of estimating distances using different alignment lengths. In the first study, 20 random subtrees from an original tree of 58 species were created, four each of size m = 33, m = 38, m = 43, m = 48, and m = 53 (where m is the size of the taxon set for a given protein). For each tree, a rate was sampled from a precomputed distribution of rates based on real data (data not shown). Protein alignments of length 100, 300, 500, and 1000 were simulated using Pseq-Gen (Grassly et al., 1997). A second analysis compared rate estimates with increasing amounts of data. Twenty-one random subtrees from the original tree of 58 species were created, 7 each of size m = 16, m = 30, and m = 44 (corresponding to approximately 25%, 50%, and 75% of the species [as in Eulenstein et al., 2004]). For each tree, a rate was sampled from a precomputed distribution of rates based on real data (data not shown). Alignments of length 1000 were generated. For both studies, 10 replicates were performed for each set of parameters.

 
Statistics measured on the simulated data, including goodness-of-fit and mean squared error, are explained in detail in Appendix 1. These statistics were used to relate the accuracy of the DistR rate estimates to the known rates at which the proteins were simulated.

Experimental Studies—Empirical Data
The data analyzed in this study consist of a set of 15 aligned mitochondrial protein sequences from 29 taxa. The taxon names and accession numbers are given in Table 1. Protein names and alignment accession numbers appear in Table 2. This multiprotein data set is of moderate size, and variants thereof have been used in numerous publications (e.g., Bullerwell et al., 2003; Lang et al., 2002; Sumida et al., 2001; Tomita et al., 2002). Furthermore, some of the species have high evolutionary rates and substitutional saturation of sites (i.e., Smittium), whereas others have very short branches in the resulting phylogenetic tree. Combined, these two properties can cause inaccurate grouping of the taxa due to long-branch attraction artifacts (Felsenstein, 1978).


View this table:
[in this window]
[in a new window]

 
Table 1 Empirical data analyzed. Names and accession numbers for protein sequences studied from Fungal species and outgroup. Fifteen proteins were downloaded for each species (if present in the species), the names of which are in Table 2.

 


View this table:
[in this window]
[in a new window]

 
Table 2 DistR estimates for empirical data based on pairwise and patristic ML distance estimates. Mean rate estimates and variances for rate estimates are based upon bootstrap replicates over the fungal data set. Rates are normalized so that the average rate is one. Acc. no. = accession number for the alignment in EMBL. AL = alignment length. Patristic refers to rates estimated based on distances from maximum likelihood trees. Pairwise refers to rates estimated based on maximum likelihood distances.

 
Alignments were performed using the default settings of ClustalW (Thompson et al., 1994). Highly variable sites or those with many gaps were eliminated using Gblocks (Castresana, 2000) with the following settings: number of sequences for a flank position equal to half the number of species plus one; number of contiguous nonconserved positions equal to 10; minimum length of a block four; half the species allowed gaps. All other parameters were set to default.

The key questions addressed using real protein data are:

  • Comparison of DistR estimates to ML estimates.—How do DistR rate estimates compare to those obtained using the ML based method COMBINE (Pupko etal., 2002b)?
  • Comparison of DistR estimates to Bayesian estimates.—How do DistR rate estimates compare to those obtained by MrBayes (Huelsenbeck and Ronquist,2001) under a Bayesian approach?
  • Patristic versus pairwise ML distances.—How do rate estimates from pairwise ML distances and rate estimates from patristic ML distances compare when applied to real data?
  • Inclusion of DistR estimates into the phylogenetic tree search of PHYML.—What is the affect of including DistR estimates in an ML tree search? Is there a significantly improved fit? Are improved phylogenetic estimates obtained?

Comparison of DistR estimates to ML estimates
Note that when comparing DistR rates to those computed using COMBINE (Pupko et al., 2002b), the number of taxa and proteins had to be restricted, because COMBINE can currently only handle data sets for which all taxa are present in all proteins. Two different starting trees were included in the analysis: the ML tree from PHYML based upon the concatenated data set and the ML tree from PHYML when protein rates were incorporated. Rates were estimated under three different models: global amino acid frequencies with one gamma distribution; local amino acid frequencies (for each protein partition) with one gamma distribution; local amino acid frequencies with one gamma distribution for each partition.

Comparison of DistR estimates to Bayesian estimates
Bayesian estimation of the posterior distribution of the protein rates was performed using MrBayes version 3.0 (Huelsenbeck and Ronquist, 2001). Default priors were used with the JTT model of evolution plus one gamma distribution (eight categories), one parameter for the proportion of invariant sites, and one set of branch lengths for the entire data set. This is the same model that is used for the PHYML + protein rates analysis of the data. Two runs of four chains with 300,000 iterations were performed; the burn-in used was 30,000. A further analysis of the data was performed without protein rates (using the same model) in order to compare to the concatenated PHYML analysis. Four chains were run for 150,000 iterations, with a burn-in of 15,000. Convergence of the chains was determined empirically.

Inclusion of DistR estimates into the phylogenetic tree search of PHYML
DistR rates were incorporated into the ML framework of PHYML following the proportional approach (Pupko et al., 2002b; Yang, 1996); however, optimization over the rates was not performed. ML trees over the entire data set were calculated in four different ways using this modified version of PHYML. In the first analysis, the proteins were simply concatenated (equivalent to a rate of one for each protein). In the second analysis, the estimated protein rates from the real data set (based on patristic ML distances) were used for each bootstrap replicate when computing the likelihood. In the third and fourth analyses, protein rates were estimated for each bootstrap replicate using patristic and pairwise ML distances respectively. These rates were incorporated into the likelihood computation for each bootstrap replicate. Consensus trees were computed using the CONSENSE program available in the PHYLIP package (Felsenstein, 2004b).


    Results and Discussion
 Top
 Abstract
 Methods
 Results and Discussion
 Conclusion
 Appendix 1. Formula for...
 Appendix 2. Fast Algorithm...
 Appendix 3. The DistR...
 Acknowledegements
 References
 
Simulated Data
Patristic versus pairwise ML distances
The first simulation study demonstrates two important results: pairwise ML distances provide equally good distance estimates as patristic ML distances to the DistR method (Fig. 3); if the fit of the initial pairwise/patristic ML distances to the data is accurate then the DistR estimates will be accurate (Figs. 3 and 4). The first result is important since pairwise ML distances are very fast to compute. The second result indicates that error in the rate estimates stems principally from error in the distance estimates, rather than the DistR method itself.


Figure 3
View larger version (38K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Figure 3 Mean squared error for different methods of distance estimation and different alignment lengths. The rates at which the data were simulated are labeled on the left-hand side of the graph. The mean rate estimate for a given distance estimation method, alignment length, and rate is given on the right of the MSE bar. AL = alignment length. The 10 fastest proteins are in the left-hand column. The number of species in each protein (from fastest to slowest) are Protein 1: 53 species; Protein 2: 38 species; Protein 3: 33 species; Protein 4: 53 species; Protein 5: 38 species; Protein 6: 48 species; Protein 7: 53 species; Protein 8: 48 species; Protein 9: 43 species; Protein 10: 33 species. The 10 slowest proteins are in the right-hand column. The number of species in each protein (from fastest to slowest) are Protein 1: 33 species; Protein 2: 48 species; Protein 3: 43 species; Protein 4: 43 species; Protein 5: 48 species; Protein 6: 33 species; Protein 7: 43 species; Protein 8: 53 species; Protein 9: 38 species; Protein 10: 38 species. All rates are normalized so that the average rate is one over all 20 proteins. The total number of taxa in the data set is 58.

 


Figure 4
View larger version (17K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Figure 4 Average error of DistR rate estimates compared to goodness-of-fit of distances based upon patristic and pairwise ML distance estimates. (a) DistR rate estimates were based upon simulated proteins of length 100. (b) DistR rate estimates were based upon simulated proteins of length 300. A higher value for goodness-of-fit means that the fit of the estimated distances to the original distances is better.

 
The numerical results from the first experiment are summarized in Figure 3. The proteins are sorted in order of increasing rate, and the histogram indicates the mean squared error (MSE) over the 10 different replicates (see Appendix 1 for the exact formula used to compute MSE). Mean rate estimates are labelled to the right of each MSE bar, with the rate at which the data was simulated on the left. Results are presented only for alignments of length 100 and 1000. The results for alignments of length 300 and 500 fall in-between these two extremes. Note that the MSE increases in proportion to the rate, so results are presented on two scales.

The mean estimates for the different methods were quite close to the real rates at which the data were simulated, regardless of the alignment length, procedure used to estimate the distances, or rate at which the data was simulated (Fig. 3). However, it is clear from the mean squared error that the DistR estimates based on shorter alignments have larger error (or greater variation), despite the fact that the mean rate estimate is often almost as accurate as that for longer alignments. Furthermore, the mean squared error tends to increase with higher rates. This is likely because the error is often in the third significant digit; for slower rates this will lead to a smaller MSE. Overall there is negligible difference between the mean and MSE statistics for a given alignment length (comparing DistR estimates based on patristic versus pairwise ML distances).

Results also indicate that errors in the rate estimates are due to errors in the original distances rather than approximations introduced in the DistR method. For each protein and alignment length the absolute error between the mean rate estimates and the real rate at which the alignments were simulated was compared to the goodness-of-fit between the estimated and true distances (Fig. 4). This fit can be measured since the data are simulated under a known model at a particular rate. Alignments of length 100 and 300 only were examined, since the errors become negligible for longer alignments. The fit was measured using the goodness-of-fit statistic of Tanaka et al. (Tanaka and Huba, 1985), which is determined from the sum of squares error between true and estimated distances, normalized by the sum of the true distances squared. The exact formula for goodness-of-fit is presented in Appendix 1. The statistic has a maximum of one, which indicates a perfect fit.

It is expected that with longer alignments the goodness-of-fit will increase, indicating that the fit of the model to the data is better. This is clearly the case as seen when comparing goodness-of-fit for alignments of length 100 (Figs. 4a) to that for alignments of length 300 (Figs. 4b). The fit is further improved, and relative error reduced, with alignments of length 500 and longer (data not shown). The decrease in the goodness-of-fit (indicating a worse fit) seen with short alignment lengths indicates that the error of the method is dependent upon the error of the distance estimates and is not a property of the estimation procedure itself.

Interestingly, the error in rate estimation is in some cases less when based upon pairwise ML distances, rather than patristic ML distances. Given that the multiple sequence alignments are short (100 and 300 amino acid residues) and include many species (at least 33 in each protein alignment), there are many trees that will fit the data equally well. Thus, there is high variation in building a ML tree to fit the original tree on which the data were simulated. Hence, estimating a ML tree with few data will likely lead to an incorrect topology. This will result in a worse fit between the original tree and the tree estimated from the alignment data. This is not true for pairwise ML distances, which do not account for topology.

Missing distances between taxa
In the previous experiment, less than half of the taxa were missing in each protein, and 20 proteins were used to estimate rates. The effects of more extreme missing taxa were also tested, where no distance estimates were present between some pairs of taxa. To achieve this, up to 75% of the taxa were removed from the starting tree. Additionally, many fewer proteins were used for DistR estimation. Results indicate that the DistR method is robust to missing taxa, though having many missing taxa led to the expected increase in variance of the rate estimates.

Figure 5 summarizes the error in rate estimates for two simulated data sets. In the first example (Fig. 5a) there are four protein trees, each with 16 taxa ({approx} 28% of the total taxon set). In the second example (Fig. 5b) there are eight protein trees. Seven of these have 16 taxa and the other has 30 taxa. The proteins are ordered from fastest to slowest rate in both Figures 5a and 5b. Mean rate estimates are shown on the right of the MSE, and the rate at which the protein simulated (averaged to equal one) is given on the left. Simulated proteins in Figure 5a are labeled from I to IV. The same simulated proteins in Figure 5b are likewise labeled.


Figure 5
View larger version (18K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Figure 5 Mean squared error for different methods and different amounts of distance data. The rates at which the data were simulated are labelled on the left-hand side of the graph in both (a) and (b). Mean rate estimates for both distance estimation methods are labelled on the right of the MSE bars for each protein. All rates are normalized so that the average rate is one in both (a) and (b) and are sorted from fastest to slowest. Proteins that are the same in both (a) and (b) are labelled. (a) Rate estimates based upon a data set consisting of four proteins with 16 taxa each. (b) Rate estimates based upon a data set consisting of eight proteins; seven with 16 taxa and one with 30 taxa.

 
Once again it is evident that pairwise ML distances and patristic ML distances give almost identical average relative rate estimates (to within two or three decimal places). Furthermore, the missing data has little effect on mean rate estimates, but does have a large effect on the variance. For instance, comparing the MSE for the first protein in Figure 5a to that of the second protein in Figure 5b (it is the same simulated protein), it is clear that although the mean rate estimate is approximately as accurate with more taxa (Fig. 5b), the MSE is clearly smaller when more distances between a pair of taxa are included in the analysis. Thus it is evident that more data in terms of pairwise distances between taxa (over multiple proteins) will reduce the error of the DistR estimate.

Calculation of the relative rates within groups of the same number of species was also performed (i.e., proteins with 16 species, proteins with 30 species, and proteins with 44 species). For each subset of proteins mean rate estimates based on pairwise ML distances were slightly worse or identical to those based on patristic ML distances (data not shown). In addition, the variances were greater in general for rates estimated based on pairwise ML distances. The major difference between the three analysis was that the variance of the rate estimates was lower when more species were included in the analysis. Furthermore, the mean rate estimates were slightly more accurate for the data sets over larger taxon groups (data not shown).

Accuracy in spite of missing taxa demonstrates that the rate estimation procedure is consistent (assuming that the initial distance estimates are accurate), regardless of the number of proteins under analysis. This is because rates are not computed relative to the distance estimates of one protein. Rather, they are constrained by all the distance estimates. Thus, if one set of distance estimates is extremely biased with respect to the remainder of the distances they will not have a strong effect on the final rate estimates.

Empirical Data
Comparison of DistR estimates to ML estimates
Rates were calculated in a ML framework using only those proteins that are present over the entire species set (Atp6, Cob, Cox1, Cox2, and Cox3) due to a constraint of the program COMBINE (Pupko et al., 2002b). Table 3 shows the time for rate estimation and rate estimates based on different models under the ML framework in comparison to DistR estimates based on pairwise and patristic ML distances. Two sets of ML estimates are given for each model. The first based upon the concatenated tree, and the second on the DistR incorporated ML tree. DistR estimates are computed far more rapidly and are still accurate in comparison to ML estimates. In comparison to the six ML estimates, the DistR rates based on patristic ML distances are slight overestimates for Cob and Cox1, and slight underestimates for Cox2 and Cox3. The estimate for Atp6 is an average of the 6 ML estimates (Table 3). Notably, the patristic DistR estimates for Cob and Cox1 are closest to the ML estimates based on the rate-incorporated tree using global amino acid frequencies plus the one-gamma-distribution model. Conversely, the DistR estimates for Cox2 and Cox3 are closest to the ML estimates based on the same tree, using local amino acid frequencies and the five-gamma-distribution model. The DistR estimates based on pairwise ML distances are quite close to those based on patristic ML distances, except for Atp6 and Cox3. Atp6 has a much higher rate—quite close to the ML estimate for the LF + 5-GAM model where the estimates were based on the rate-incorporated ML tree. However, the Cox3 estimate is quite low compared to all ML estimates; Cox3 had a higher variation in rate estimation over all proteins (Table 3), a case where perhaps the lack of topological information decreases the accuracy of the DistR estimate. Clearly this is not an issue for most proteins, but can be an issue for some. Overall it appears that the DistR estimates are model independent regardless of distance estimation procedure and provide excellent first approximations to the ML estimates.


View this table:
[in this window]
[in a new window]

 
Table 3 Comparison of ML rate estimates to DistR estimates. Comparison of relative rate estimates and estimation time from COMBINE and DistR for five proteins (Atp6, Cob, Cox1, Cox2, and Cox3) from the fungal data set. For each model, rates based upon the maximum likelihood concatenated tree from PHYML are given on the first line, and rates based upon the maximum likelihood tree incorporating DistR rates (computed in PHYML) are given on the second. All estimates were normalized so that the average rate is one. GF = global amino acid frequencies; LF = local amino acid frequencies (calculated for each protein); 1-GAM = one gamma distribution estimated for the entire data set; 5-GAM = one gamma distribution for each protein; DistR Pat = DistR estimation using patristic ML distances; DistR Pair = DistR estimation using pairwise ML distances.

 
Comparison of DistR estimates to Bayesian estimates
The posterior distribution of rates from MrBayes is shown in Figure 6. For all but three of the proteins the DistR estimates fall within the 95% posterior credible interval for the protein rate. Each of Nad6, Cox1, and Cox3 have DistR estimates that do not fall between the 95% posterior credible interval. Both Cox1 and Cox3 have average sequence lengths, and 29 taxa each. Nad6 is shorter at less than 100 amino acids, with only 24 species. In the case of Nad6 perhaps the short sequences length contributes to uncertainty in the DistR estimates. However, it is unlikely that the Bayesian posterior distributions of the rates are accurate. This conclusion is based upon the fact that the four chains were mixing quite poorly in both runs even after 300,000 iterations (data not shown). Sampling from the posterior distribution is unlikely to be correct since the chain might be oversampling from areas of low likelihood. Comparison of the tree of the highest likelihood from this analysis to the tree of highest likelihood based on the concatenated data indicates that MrBayes was in a suboptimal topological space when sampling rate estimates (using the Bayesian information criterion, data not shown). Furthermore, the DistR ML tree is a significantly better fit of the model to the data based on the AIC (Felsenstein, 2004a) when compared to the likelihood of the MrBayes rate incorporated tree as computed in PHYML. Thus, although the posterior distribution of the rates appears reasonable, the chain seems to be having difficulty sampling through topology space.


Figure 6
View larger version (17K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Figure 6 Distribution of rates from the MrBayes proportional model analysis compared to DistR estimates. Bars at either end represent the 95% credible interval. The DistR estimate based upon patristic ML distances is marked by a solid triangle. The DistR estimate based upon pairwise ML distances is marked by a square. The posterior rate estimates of MrBayes are given by a solid square. DistR estimates are normalized so that the average rate is one (as in MrBayes). Proteins are ordered from shortest to longest as follows: Atp8, Atp9, Rps3, Nad3, Nad4, Nad4L, Nad6, Atp6, Cox2, Cox3, Nad1, Nad2, Cob, Nad4, Cox1, and Nad5.

 
Thus, it appears that the proportional model under MrBayes, when used without different parameters for each partition (as in Nylander et al., 2004), does not search tree space as well as PHYML with the rate multipliers included. Perhaps this is due to an incorrect prior on the rate parameters used. If this is the problem the DistR method can certainly be used to find a distribution of the rates of proteins, which could be used as the prior on these parameters. The discrepancy could also be due to the different search heuristics used in MrBayes. Given the computational complexity of the search, it might be difficult for the program to search for the best rate parameters while also searching for the best topology.

Patristic versus pairwise ML distances
The relative protein rates of the real data are unknown. However the variance of the rate estimates using both patristic and pairwise ML distances can be compared, a smaller estimate being preferable. Contrary to expectations, but confirming the simulation studies, rate estimates from pairwise ML distances had smaller variance than rate estimates from patristic ML distances.

Variances of the rate values computed were estimated by nonparametric bootstrap of the protein alignments, and reestimation of the distances and DistR rates for each bootstrap data set. The mean and variance of the DistR estimates for pairwise and patristic ML distances show some interesting trends (Table 2). In general, the average rate estimates were similar, with the notable exception of Atp8, Cox3, and Rps3 (and to a lesser extent Nad2, Nad5, and Nad6). Ten of the 15 protein rates derived from patristic ML distances had greater variance than their counterparts derived from pairwise ML distances. (Table 2). These results support the conclusion that introducing topology into the distance estimation procedure is not likely to lead to better distances estimates for the DistR procedure when so many taxa are involved and the alignments are short. This is a consequence of the large number of distinct trees that can fit a short alignment equally well.

Inclusion of DistR estimates into phylogenetic tree search of PHYML
The experimental results when DistR estimates are incorporated into the ML tree search demonstrate the importance of accounting for different evolutionary pressures in phylogenetic inference.

Bootstrap support values for the ML tree using concatenated data are presented in Figure 7a. The bootstrap support for some of the clades was quite weak. Incorporating DistR estimates based upon both patristic and pairwise ML distances into the tree search led to the same ML tree, presented in 7b. Overall, bootstrap support was improved in most clades when DistR estimates were incorporated into the tree search.


Figure 7
View larger version (40K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Figure 7 (a) Phylogenetic analysis based upon the mitochondrial data set. The topology shown was inferred using PHYML without DistR protein rates, using the JTT model of protein evolution, with eight gamma categories, and ML estimation of the alpha parameter of the gamma distribution and the proportion of invariant sites. It was constructed using the concatenated "unambiguously" aligned proteins. Bootstrap support for this topology was computed based upon 100 replicates. The percentage of support for each clade is given at the root of the clade. In cases where the consensus tree differed from the maximum likelihood topology a "–" is written. (b) Phylogenetic analysis based upon mitochondrial data set. The topology shown was inferred using PHYML with DistR protein rates, using the JTT model of protein evolution, with eight gamma categories, and ML estimation of the alpha parameter of the gamma distribution and the proportion of invariant sites. It was constructed using the concatenated unambiguously aligned proteins and protein rate estimates. The percentage of support for each clade is given. Bootstrap support for this topology was computed based upon 100 replicates, using three different methods. The top numbers give the percentage of support based upon using the patristic ML distance DistR estimates from the real data as rate values in computing the ML tree for each bootstrap replicate. The middle numbers give the percentage of support based upon reestimating DistR estimates for each bootstrap replicate using patristic ML distances. The bottom numbers give the percentage of support based upon reestimating DistR estimates for each bootstrap replicate using pairwise ML distances. When bootstrap support was the same for each method of incorporating rates it is given only once.

 
The topology of the ML concatenation-based tree does not separate Zygomycota and Ascomycota as distinct clades, which is not surprising because the Zygomycota are traditionally difficult to place. Furthermore, the outgroup is incorrect since it should also contain Homo sapiens (which groups incorrectly with the zygomycete Smittium and the Ascomycota). This long-branch-attraction problem is due to the highly derived Smittium and Homo sequences. Using DistR estimates improves the bootstrap support in certain clades, and corrects the most evident topological problems, notably that Zygomycota more accurately group together (although as an unresolved paraphyletic group). Indeed, almost every branch that does not show 100% bootstrap support with the concatenated data have improved support when using protein rates. The only branching where support somewhat lessened from the concatenated to the protein-rate-based trees (and with using individual bootstrap rates) was the branching of Allomyces (a species that is difficult to place whatever the method or data set) with the remainder of the Chytridiomycota (Figs. 7a and b). Bootstrap support is strongest when using protein rates based upon pairwise ML distances, where the rate estimates were recomputed for each bootstrap replicate. This is perhaps because the variation in the pairwise ML distance rate estimates was smaller than, or on the same order of magnitude as, the rate estimates based on patristic ML distances.

Both the Kishino-Hasegawa (KH) test and Akaike Information Criterion (AIC) support the ML topology with protein rates as a better fit for the model to the data than the concatenated topology. Under the KH test (Kishinoand Hasegawa, 1989, Shimodaira and Hasegawa, 2001); the concatenated topology was significantly worse than the DistR topology (P < 0.0001) when the topology was computed with rate estimates calculated based on both patristic and pairwise ML distances. The AIC provides a statistical measurement of the significance of the change in log-likelihood when using two different models to fit the data. The measure compensates for the increase in the number of parameters in the rates model. When DistR estimates based on pairwise ML distances are used, the AIC is 1043.65182 greater than the AIC for a single rate, concatenated analysis. When patristic ML distances are used for rate estimation, the increase in AIC over the concatenated analysis is 1068.7542. Both increases in AIC are very substantial, indicating that important information in the data that is disregarded by traditional concatenated analysis is captured by modeling protein rates.


    Conclusion
 Top
 Abstract
 Methods
 Results and Discussion
 Conclusion
 Appendix 1. Formula for...
 Appendix 2. Fast Algorithm...
 Appendix 3. The DistR...
 Acknowledegements
 References
 
A fast and accurate method to calculate the rates of partitioned data sets is presented. Although the analyses performed here are based upon protein sequence data, using nucleotide sequences should prove as effective. The error in the method is largely due to incorrect initial distance estimates for the proteins, which tend to be worse with smaller or poorly conserved sequences. Using pairwise ML distances for DistR estimation is just as accurate as using patristic ML distances. The estimates are accurate when compared to ML estimates and Bayesian posterior credible intervals for the rates. Incorporating the DistR estimates into PHYML leads to statistically better likelihood and topology.


    Appendix 1. Formula for Mean Squared Error and Goodness-of-Fit
 Top
 Abstract
 Methods
 Results and Discussion
 Conclusion
 Appendix 1. Formula for...
 Appendix 2. Fast Algorithm...
 Appendix 3. The DistR...
 Acknowledegements
 References
 
Mean squared error is used to describe the accuracy of rate estimates. Because only relative rates can be computed rates are normalized so that the average rate over all proteins is one. Let Formula denote the true rate (that is, the rate used in simulations), and let Formula ,...,Formula be the rates estimated in the 10 replicates of the experiment. The mean squared error (MSE) is defined as


Formula

Goodness-of-fit is used to measure the fit of the distance estimates to the distances in the tree used for simulation. There is a slight problem with scales since Pseq-Gen treats branch lengths as the expected number of substitutions per 100 sites while PHYML treats branch lengths as the expected number of substitutions per site. Let [dtilde](k)xy be the distance between x and y in the tree used to simulate protein k, let rk denote the rate used when simulating protein k, and let Formula (k)xy be the distance estimated by PHYML.

Given the differences in scale the goodness-of-fit measure used was


Formula

Note that the goodness-of-fit is at most one, and equals one if and only if there is a perfect fit.


    Appendix 2. Fast Algorithm for Least-Squares Estimation
 Top
 Abstract
 Methods
 Results and Discussion
 Conclusion
 Appendix 1. Formula for...
 Appendix 2. Fast Algorithm...
 Appendix 3. The DistR...
 Acknowledegements
 References
 
This appendix shows how to quickly determine the vectors p and r that minimize the function q(p, r) in Equation (3)


Formula

subject to the constraint that h(p) = {kappa}, where


Formula

and {kappa} is an arbitrary, positive constant. In the implementation of DistR


Formula

which corresponds to the assumption that the unknown consensus distances are roughly centered on the average of the observed distances. This value can be computed in O(nm2) time for n proteins and m taxa. Any other positive constant will work, as the only effect is to change the scale of the rate estimates.

To simplify the mathematics substitute Formula for each k = 1,...,n. Let s denote the vector [s1, ...,sn]T. Minimizing q(p, r) is then equivalent to minimizing


Formula 5

(5)

Recall from calculus that the minimum of a one dimensional function can be found by determining where the first derivative is equal to zero. This condition extends to multidimensional functions with constraints. Refer to Gill et al. (1982) for an excellent introduction to the optimization tools used here.

If (p, s) together minimize the function f, subject to the condition that h(p) = {kappa}, then there exists a real number {lambda} such that


Formula 6

(6)

In general, (6) is only a necessary condition for reaching the minimum, and not a sufficient condition. However, in this case the matrix formed from the second derivatives of f(p, s) is positive definite, so that the function f is convex (Gill et al., 1982). It follows that if (p, s) and {lambda} satisfy (6) then (p, s) gives the global minimum.

It is possible to derive the partial derivatives of the functions f and h explicitly. To help with notation define the quantities:


Formula

The partial derivative of f with respect to sk, for some protein k, is


Formula

The partial derivatives of f and h with respect to pxy, for some taxa x,y, are


Formula

Note from the partial derivatives that the conditions in Equation (6) are linear equations involving the entries of p, s, and {lambda}. As such, the next step is to rewrite 6 in terms of matrix algebra. Given that there are n proteins and m taxa define the following: let D be the n x n matrix with {alpha}1,{alpha}2,...,{alpha}n down the diagonal and zeros off the diagonal; let C be the Formula matrix with β12, β13, ..., β(m–1)m down the diagonal and zeros off the diagonal; let B be the Formula matrix with rows indexed by unique pairs of taxa, columns indexed by proteins, and the entry corresponding to row xy and column k equal to βxy,k; let v be the Formula dimensional vector Formula .

The conditions in Equation (6) can now be rewritten as


Formula 7

(7)


Formula 8

(8)


Formula 9

(9)
Define


Formula

Solving for p in (8) gives:


Formula 10

(10)
Substituting this into (9) and solving for {lambda} gives:


Formula

Replacing {lambda} with the above equation in (10) provides a solution for p in terms of the above defined matrices, vectors and s (i.e., there are no longer any unknowns except for p and s):


Formula 11

(11)


Formula 12

(12)
Finally, substitute (12) into (7) to get


Formula

Let


Formula

Then, s is found by solving the equation:


Formula 13

(13)
Consensus distances p are obtained by substituting s into Equation (12).

The entire computation is summarized in Appendix 3. The running time of the algorithm is O(nm2+n3) which is time optimal. The algorithm uses O(n2 + m2) memory in addition to the O(nm2) required to store the distance estimates Formula .

There are two complications that can arise in the above calculations. Firstly, it could be the case that for a particular pair of taxa x,y there is no single protein that contains both x and y. This means that βxy is undefined, so that C is no longer invertible. This problem is easily solved. If there is no protein with both x and y then the line in (6) involving the partial derivative with respect to pxy is satisfied trivially. Therefore, the row and column of C, the row of B, and entry of v indexed by the pair x,y can be removed. The reduced problem can be solved as before, although no estimate for pxy is obtained. Row removal is handled in the pseudocode for the algorithm given in Appendix 3 by using constraints in the summations.

The second complication is that the optimization problem might have more than one solution, in which case the matrix M in (13) will not be invertible. This indicates that more information is required to estimate the relative rates, as would arise, for example, in a concatenation of two protein alignments over entirely different sets of taxa.


    Appendix 3. The DistR Algorithm
 Top
 Abstract
 Methods
 Results and Discussion
 Conclusion
 Appendix 1. Formula for...
 Appendix 2. Fast Algorithm...
 Appendix 3. The DistR...
 Acknowledegements
 References
 
Formula


    Acknowledegements
 Top
 Abstract
 Methods
 Results and Discussion
 Conclusion
 Appendix 1. Formula for...
 Appendix 2. Fast Algorithm...
 Appendix 3. The DistR...
 Acknowledegements
 References
 
We thank Scott Bunnell, Alain Vandal, Tad Pupko, Tim Collins, and Olivier Gascuel for helpful comments on the manuscript. Thanks to Stéphane Guindon for kindly providing the source code of PHYML v2.2 for our use. Salary and support from the Canadian Institutes of Health Research (MOP 42475; BFL), the Canadian Institute for Advanced Research (CIAR; BFL), National Science and Engineering Research Council (NSERC grant 238975-01; DB), Fonds de recherche sur la nature et les technologies (FQRNT grant 2003-NC-81840; DB), and supply of laboratory equipment and informatics infrastructure by Genome Canada are gratefully acknowledged. RBB is supported by an NSERC PGS-B scholarship.


    References
 Top
 Abstract
 Methods
 Results and Discussion
 Conclusion
 Appendix 1. Formula for...
 Appendix 2. Fast Algorithm...
 Appendix 3. The DistR...
 Acknowledegements
 References
 

    Bapteste E., Brinkmann H., Lee J. A., Moore D. V., Sensen C. W., Gordon P., Duruflé L., Gaasterland T., Lopez P., Müller M., Philippe H. The analysis of 100 genes supports the grouping of three highly divergent amoebae: Dictyostelium, Entamoeba, and Mastigamoeba. Proc. Nat. Acad. Sci. (2002) 99:1414–1419.[Abstract/Free Full Text]

    Bull J., Huelsenbeck J. P., Cunningham C. W., Swofford D. L., Waddell P. J. Partitioning and combining data in phylogenetic analysis. Syst. Bio. (1993) 42:384–397.

    Bullerwell C. E., Forget L., Lang B. F. Evolution of monoblepharidalean fungi based on complete mitochondrial genome sequences. Nucleic Acids Res. (2003) 31:1614–1623.[Abstract/Free Full Text]

    Bulmer M. Use of the method of generalized least squares in reconstructing phylogenies from sequence data. Mol. Biol. Evol. (1991) 8:868–883.[Web of Science]

    Castresana J. Selection of conserved blocks from multiple alignments for their use in phylogenetic analysis. Mol. Biol. Evol. (2000) 17:540–552.[Abstract/Free Full Text]

    Cranston K., Rannala B. Closing the gap between rocks and clocks. Heredity. (2005) 94:461–462.[CrossRef][Web of Science][Medline]

    Eulenstein O., Chen D., Burleigh J. G., Fernández-Baca D., Sanderson M. J. Performance of flip supertree construction with a heuristic algorithm. Syst. Biol. (2004) 53:299–308.[Abstract/Free Full Text]

    Felsenstein J. Cases in which parsimony or compatibility methods will be positively misleading. Syst. Zool. (1978) 27:401–410.[Abstract/Free Full Text]

    Felsenstein J. Taking variation of evolutionary rates between sites into account in inferring phylgenies. J. Mol. Evol. (2001) 53:447–455.[CrossRef][Web of Science][Medline]

    Felsenstein J. Inferring phylogenies (2004a) Sunderland, Massachusetts: Sinauer Associates. pages 148–149.

    Felsenstein J. PHYLIP (Phylogeny Inference Package) version 3.6. (2004b) Seattle: University of Washington. Distributed by the author, Department of Genome Sciences. URL: http://evolution.genetics.washington.edu/phylip.html.

    Gill P., Murray W., Wright M. Practical optimization. (1982) Academic Press.

    Grassly N. C., Adachi J., Rambaut A. PSeq-Gen: An application for the monte carlo simulation of protein sequence evolution along phylogenetic trees. Comput. Appl. Biosci. (1997) 13:559–560.[Free Full Text]

    Guindon S., Gascuel O. A simple, fast and accurate algorithm to estimate large phylogenies by maximum likelihood. Syst. Biol. (2003) 52:696–704.[Abstract/Free Full Text]

    Huelsenbeck J. P., Bull J., Cunningham C. W. Combining data in phylogenetic analysis. Tree. (1996) 11:152–158.

    Huelsenbeck J. P., Ronquist F. MRBAYES: Bayesian inference of phylogenetic trees. Bioinformatics. (2001) 17:754–755.[Abstract/Free Full Text]

    Kishino H., Hasegawa M. Evaluation of the maximum likelihood estimate of the evolutionary tree topologies from DNA sequence data, and the braching order in hominoidea. J. Mol. Evol. (1989) 29:170–179.[CrossRef][Web of Science][Medline]

    Lang B. F., O'Kelly C., Nerad T., Gray M. W., Burger G. The closest unicellular relatives of animals. Curr. Biol. (2002) 12:1773–1778.[CrossRef][Web of Science][Medline]

    Lapointe F., Cucumel G. The average consensus procedure: Combination of weighted trees containing identical or overlapping sets of taxa. Syst. Biol. (1997) 46:306–312.[Abstract/Free Full Text]

    Mayrose I., Graur D., Ben-Tal N., Pupko T. Comparison of site-specific rate-inference methods: Empirical Bayesian methods are superior. Mol. Biol. Evol. (2004) 21:1781–1791.[Abstract/Free Full Text]

    Nylander J. A. A., Ronquist F., Huelsenbeck J. P., Nieves-Aldrey J. L. Bayesian phylogenetic analysis of combined data. Syst. Biol. (2004) 53:47–67.[Abstract/Free Full Text]

    Olsen G. J., Pracht S., Overbeek R. DNArates. URL:http://geta.life.uiuc.edu/gary/programs/DNArates.html.

    Pupko T., Bell R., Mayrose I., Glaser F., Ben-Tal N. Rate4Site: An algorithmic tool for the identification of functional regions in proteins by surface mapping of evolutionary determinants within their homologues. Bioinformatics (2002a) 18:S71–S77.[Abstract]

    Pupko T., Huchon D., Cao Y., Okada N., Hasegawa M. Combining multiple data sets in a likelihood analysis: Which models are the best? Mol. Biol. Evol. (2002b) 19:2294–2307.[Abstract/Free Full Text]

    Shimodaira H., Hasegawa M. CONSEL: For assessing the confidence of phylogenetic tree selection. Bioinformatics. (2001) 17:1246–1247.[Abstract/Free Full Text]

    Sumida M., Kanamori Y., Kaneda H., Kato Y., Nishioka M., Hasegawa M., Yonekawa H. Complete nucleotide sequence and gene rearrangement of the mitochondrial genome of the japanese pond frog Rana nigromaculata. Genes Genet. Systems. (2001) 76:311–325.[CrossRef][Web of Science][Medline]

    Tanaka J. S., Huba G. J. A fit index for covariance structure models under arbitrary GLS estimation. Br. J. Math. Statist. Psych. (1985) 38:197–201.

    Thompson J. D., Higgins D. G., Gibson T. J. CLUSTAL W: Improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. Nucleic Acid Res. (1994) 22:4673–4680.[Abstract/Free Full Text]

    Tomita K., Yokobori S., Oshima T., Ueda T., Watanabe K. The cephalopod Loligo bleekeri mitochondrial genome: Multiplied noncoding regions and transposition of tRNA genes. J. Mol. Evol. (2002) 54:486–500.[CrossRef][Web of Science][Medline]

    Yang Z. Maximum likelihood estimation of phylogeny from DNA sequences when substitution rates differ over sites. Mol. Biol. Evol. (1993) 10:1396–1401.[Abstract]

    Yang Z. Maximum-likelihood models for combined analyses of multiple sequence data. J. Mol. Evol. (1996) 42:587–596.[CrossRef][Web of Science][Medline]


Add to CiteULike CiteULike   Add to Connotea Connotea   Add to Del.icio.us Del.icio.us    What's this?


This article has been cited by other articles:


Home page
Syst BiolHome page
J. W. Leigh, E. Susko, M. Baumgartner, and A. J. Roger
Testing Congruence in Phylogenomic Analysis
Syst Biol, February 1, 2008; 57(1): 104 - 115.
[Abstract] [Full Text] [PDF]


Home page
Syst BiolHome page
R. B. Bevan, D. Bryant, and B. F. Lang
Accounting for Gene Rate Heterogeneity in Phylogenetic Inference
Syst Biol, April 1, 2007; 56(2): 194 - 205.
[Abstract] [Full Text] [PDF]


Home page
Syst BiolHome page
A. Criscuolo, V. Berry, E. J. P. Douzery, and O. Gascuel
SDM: A Fast Distance-Based Approach for (Super)Tree Building in Phylogenomics
Syst Biol, October 1, 2006; 55(5): 740 - 755.
[Abstract] [Full Text] [PDF]


Home page
Phil Trans R Soc BHome page
A. J Roger and L. A Hug
The origin and diversification of eukaryotes: problems with molecular phylogenetics and molecular clock estimation
Phil Trans R Soc B, June 29, 2006; 361(1470): 1039 - 1054.
[Abstract] [Full Text] [PDF]


This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (PDF) Freely available
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in ISI Web of Science
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrow Search for citing articles in:
ISI Web of Science (8)
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Bevan, R. B.
Right arrow Articles by Bryant, D.
Right arrow Search for Related Content
PubMed
Right arrow Articles by Bevan, R. B.
Right arrow Articles by Bryant, D.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?