© 2006 Society of Systematic Biologists
Accurate Branch Length Estimation in Partitioned Bayesian Analyses Requires Accommodation of Among-Partition Rate Variation and Attention to Branch Length Priors
Edited by Allan Baker: Associate Editor
1 Department of Ecology and Evolutionary Biology, University of Connecticut 75 N. Eagleville Road, U-3043, Storrs, Connecticut 06269, USA E-mail: david.marshall{at}uconn.edu (D.C.M.) E-mail: chris.simon{at}uconn.edu (C.S.)
2 School of Biological Sciences, Victoria University of Wellington Wellington, New Zealand
3 Landcare Research Private Bag 92170, Auckland, New Zealand E-mail: BuckleyT{at}landcareresearch.co.nz
Received December 14, 2005; Revised March 1, 2006; Accepted September 5, 2006 Molecular phylogenetic studies are making increasing use of partitioned Bayesian analyses via software tools like MrBayes, version 3 (Ronquist and Huelsenbeck, 2003). Data partitioning is important because, as long as the same topology/history underlies all of the partitions, it addresses some of the problems associated with the combination of data sets with heterogeneous rates (Bull et al., 1993) and eliminates the need to argue the validity of tests that have been used to judge data combinability (e.g., Huelsenbeck et al., 1994; Huelsenbeck and Bull, 1996; Yang, 1996; Cunningham, 1997; Barker and Lutzoni, 2002; Buckley et al., 2002; Dowton and Austin, 2002). In addition, new studies indicate that data partitioning and the use of mixed models often dramatically improve the fit of model to data without the cost of overparameterization (Yang, 1996; Nylander et al., 2004; Brandley et al., 2005). While applying partitioned models to studies of protein-coding mitochondrial data, we have found that analyses using MrBayes may infer overly "long" trees if among-partition rate variation (APRV) is not explicitly accommodated and if data from different partitions evolve at different average rates. We have subsequently confirmed this bias in analyses of simulated datasets, as explained below. Because the erroneous tree lengths (TLs) can be nearly twice as high as those found when APRV is explicitly modeled, this finding should be of interest to those intending to apply previously calculated molecular clock rates to branch lengths estimated under partitioned models, as well as to those inferring evolutionary rates from fossil-calibrated evolutionary trees. Here we discuss the importance of among-partition rate variation and potential pitfalls in implementation of mixed models that accommodate APRV in Bayesian analyses. Note that, for this study, "APRV" refers to rate variation across a priori defined partitions, and a "no-APRV" analysis is one that defines partitions but does not explicitly accommodate among-partition rate variation in the model.
| Effect of Among-Partition Rate Variation on Estimates of Tree Length: An Example and a Simulation Study |
|---|
|
|
|---|
MrBayes 3 offers three ways for the user to model among-partition rate variation, all of which assume the same topology across partitions. The first option, prset ratepr = fixed, assumes that the average substitution rate (per site) of each partition is the same. The second option, prset ratepr = variable, adds "rate multiplier" parameters (m1, m2, mn) to the overall model that represent the relative average rates of the n partitions, one for each partition. The rates are allowed to vary subject to the constraint that the average rate of substitution across partitions is 1, and the rate multipliers are assigned a dirichlet(
number
, ...,
number
) prior. These multipliers are similar to the cg parameter of Yang (1996), except that in his method cg indicates the relative rate compared to that of the first partition, which is set to 1. The third MrBayes option, unlink brlens, allows relative branch lengths within the tree to vary across partitions—a form of heterotachy (e.g., Philippe et al., 2005). Note that adding gamma-distributed among-site rate variation (Yang, 1994b) to the model within each individual partition does not by itself accommodate rate variation across partitions: APRV is modeled by allowing the average rates of different partitions to vary.
Using data from a study of mostly protein-coding mitochondrial DNA (mtDNA) data, derived from a recent (Plio-Pleistocene age) radiation of New Zealand cicadas of the genus Kikihia (unpublished data), we found that estimates of total tree length are sensitive to underlying assumptions about APRV. Partitioned analyses that unlink branch lengths or assume partition-specific rate multipliers consistently return shorter trees than partitioned analyses that do not accommodate APRV. Table 1 (analyses A to D) compares the mean log-likelihood and model parameter estimates obtained for a set of representative analyses, one unpartitioned and several partitioned. Detailed methods are given in the Appendix; note that the model parameters were estimated separately for each partition in every analysis. The estimated TL for the no-APRV analysis (A) was nearly twice as long as the values obtained by the partitioned analyses that accommodated APRV (analyses B and C). Interestingly, the unpartitioned analysis (D) estimated about the same TL as the partitioned analyses with APRV. The changes in tree length across the three partitioned analyses were accompanied by substantial changes in other model parameters, most notably
, which determines the shape of the gamma distribution. Although the actual degree of rate variation across codon positions cannot be known with certainty, in this situation there is excellent reason to suspect substantial among-partition rate variation: The data are protein-coding, and third position sites tend to evolve much more rapidly than first or second position sites in such genes. This expectation is strengthened by the much larger number of variable third position sites in the Kikihia data set (Table 2).
|
|
Suspecting that failure to accommodate APRV was causing the longer TL values in runs such as analysis A, we reanalyzed the same mtDNA data set with first, second, and third codon position sites distributed approximately evenly among the three partitions, again without explicitly accommodating APRV in the model (Table 1, analysis E) and again while estimating model parameters separately for each partition. The mean estimate of tree length was, as expected, comparable to the TL found when APRV was accommodated in the codon-partitioned analyses. This result further indicates that higher TL estimates in no-APRV analyses are related to rate variation across partitions.
Although the above test with codon positions randomly assigned to data partitions strongly implicates failure to accommodate partition rate variation as the cause of the higher TL estimates, this inference is limited by the fact that we do not know the true evolutionary properties of the mtDNA data nor the true length of the Kikihia evolutionary tree. To address these limitations, we examined the relationship between tree length and degree of among-partition rate variation using three-partition datasets simulated on a known tree (Fig. 1) under three different levels of APRV (relative partition rates 1:1:1, 3:1:9, and 5:1:25 for treatments A, B, and C, respectively). Within each treatment, the 20 simulated data sets were each analyzed under three partitioned models differing only in the method of accommodation of APRV (no-APRV, rate-multiplier m, and unlinked branches), and the mean log-likelihood and model parameter estimates from the 20 analyses were averaged for each of these three models (Table 3). (Again, "no-APRV" here means that the model is partitioned but does not explicitly accommodate partition rate variation.) Details of the simulations and phylogenetic analyses are found in the Appendix.
|
|
The simulated data analyses displayed the same patterns as those observed with the empirical data sets. Tree length was accurately estimated by all partitioned analyses (with and without APRV accommodated) when the data sets were simulated under equal partition rates (Table 4, analyses F, G, H). TL was also accurately estimated when APRV was simulated in the data and accommodated in the analysis model (Analyses B2, B3, C2, C3). In contrast, TL was biased upward in no-APRV analyses (analyses J and M) by approximately 20% with relative partition rates of 3:1:9 and by 50% with simulated partition rates of 5:1:25 (Fig. 2). Values of
and
were poorly estimated only when APRV was simulated in the data but not accommodated in the model. When the data were simulated under varying partition rates, mean log-likelihoods were dramatically and equally improved by including the partition rate multiplier or unlinked branches in the model. The two methods of accommodating APRV performed about equally well in estimating tree length and other model parameter estimates.
|
|
Why should failure to accommodate among-partition rate variation lead to dramatically higher TL estimates? In a no-APRV analysis, MrBayes estimates a single mean per-site substitution rate (tree length, TL) for the entire data set (Ronquist and Huelsenbeck, 2003-1573), and therefore the same average rate for each partition. If the data in different partitions have evolved under different partition rates, the posterior estimate of TL will represent a compromise between these conflicting processes. But why do no-APRV analyses not converge upon the overall mean TL value as the best compromise? Higher-than-average TL estimates worsen the fit to the data in the slowly evolving partitions; lower-than-average values worsen the fit to the fast third partition. In our analyses slower-than-average sites (first partition, second partition) predominate, yet the inferred TL value in no-APRV analyses is always biased toward the average rate of the sites in the third (most rapidly evolving) partition.
We suggest that upwardly biased TL values in no-APRV analyses are caused by the asymmetrical distribution of variable sites across partitions and constraints on the ability of partition-specific parameters to accommodate the poor fit of TL to the data. When all partitions are assigned the same mean tree length, the log-likelihood estimate is more greatly penalized by the poor fit of TL to the faster partitions. This is because there is no way for many variable sites to be easily explained by a slow evolutionary rate. In contrast, the mismatch between mean TL and the sites in the slower-than-average partitions can be reduced by shifts in parameters that model the degree of among-site rate variation. This is possible because a partition with a comparatively small fraction of variable sites can be explained by both (1) a model combining a slow rate and a low level of ASRV and (2) a model combining a faster rate with a higher level of ASRV. This hypothesis fits the biases observed in the analyses of our simulated data sets. When APRV was simulated in the data but not accommodated in the phylogenetic analysis (Table 4, analyses J and M), the first and (especially) second partition alphas were strongly biased toward smaller than expected values, indicating a shift toward greater estimated among-site rate variation within those partitions. The changes in
were accompanied by upwardly biased estimates of the transition/transversion rate ratio
in the first and second partitions. (Presumably
was biased upwards because the datasets were initially simulated with a greater number of transitions.) Together, these changes indicate a model in which a small number of first and (especially) second partition sites are evolving extremely rapidly while the rest remain unchanged.
Partitioning by codon position without the prset ratepr = variable or unlink brlens options seems to be an especially poor model for estimating branch lengths in data sets with significant among-partition rate variation, such as protein-coding genes. This problem is perhaps most important in studies of closely related taxa, which can be expected to have comparatively few variable first and second codon-position sites, although more simulation work would be needed to determine what effect overall sequence divergence has on the phenomenon. Interestingly, it appears that even a single-partition analysis can outperform a no-APRV partitioned analysis in estimating TL as long as among-site rate variation is included: As noted earlier, the single-partition analysis of the empirical data set (Table 1, analysis D) found a tree length comparable to that observed in the partitioned analyses that accommodated APRV. Reducing the discrete gamma approximation to four rate categories did not change this result (unpublished data).
| Effect of Among-Partition Rate Variation on Model Selection and Nodal Support |
|---|
|
|
|---|
Accommodating among-partition rate variation dramatically improves the fit of partitioned analyses to datasets exhibiting APRV, by tens to hundreds of log-likelihood units (Tables 1 and 4). This may be an important consideration for those using the likelihood estimate to evaluate alternative candidate models as is done with the Bayes factor (see, e.g., Nylander et al., 2004; Brandley et al., 2005), although in our particular example the no-APRV model (Table 1, analysis A) still outperformed the unpartitioned model (analysis D) by this criterion ("2 ln Bayes Factor" = 12.3, or "very strong" support for the more complex model by the criteria of Kass and Raftery, 1995). The possibility of underestimating the potential gain of partitioned models is important because model selection can have a variety of effects on downstream data analyses. Model misspecification can result in the passage of clock tests even when data are non-clocklike (Huelsenbeck et al., 1997), can cause the mis-estimation of topology (Kuhner and Felsenstein, 1994; Sullivan et al., 1995), can decrease the accuracy and precision of model parameter estimation (Cunningham et al., 1998) and bootstrap support (Buckley et al., 2001; Buckley and Cunningham, 2002), and can strongly bias posterior probabilities (Lemmon and Moriarty, 2004). Furthermore, the effect of model misspecification on posterior probabilities is also influenced by branch lengths mis-estimation (Lemmon and Moriarty, 2004). This is especially important in deeper level phylogenies because error in branch length estimation increases when branches are longer (Buckley et al., 2001; Lemmon and Moriarty, 2004) or when APRV is not accommodated (this work).
In contrast to its effects on tree length estimates and log-likelihoods, inclusion of among-partition rate variation only minimally influences nodal support or topology in our analyses. In the analyses of simulated data, tree topology and relative branch lengths within the tree were unaffected by the analysis method (detailed results are obtainable from the authors upon request). Nodal support values were only minimally affected. Two of the three internal branches (Fig. 1, nodes 2 and 3) were supported by 100% posterior probabilities in every analysis. The remaining, shortest branch was recovered 100% of the time when APRV was included in the model, but was supported by slightly lower posterior probabilities in the two no-APRV analyses (Fig. 3). Treatment of APRV had only minor effects on topology and nodal support in our analyses of the more complex mtDNA data set (e.g., Fig. 4). It is striking that multiple solutions exist that are very similar in topology, relative branch lengths, and nodal support yet dramatically different in mean log-likelihood and in the total amount of evolution inferred.
|
|
| Instability of APRV Analyses Under Diffuse Branch Length Priors |
|---|
|
|
|---|
Although accommodating among-partition rate variation appears to be necessary for accurate branch length estimation, effective use of the rate multiplier or unlinked branches options appears to require close attention to the branch length prior. Specifically, as the branch length prior mean is increased, the partition rate multipliers and/or partition-specific TL values may shift to estimates that appear unrealistic for the data. Table 5 shows the posterior parameter estimates from a set of seven three-partition analyses of the Kikihia mtDNA data set that accommodated APRV by unlinking branch lengths. The general methods used were the same as those in the first set of empirical analyses (see Appendix) except that invariant sites were used only for the first partition and the few tRNA sites were distributed across the three partitions. Across the seven analyses the branch length prior mean was varied from 1.0 to 0.01 by changing the lambda (
) argument of the exponential distribution from 1 to 100 (prset brlenspr = unconstrained:exponential[
]). Note that the prior mean in MrBayes equals 1/
. Across this range of priors the partition-specific TL estimates (TL1, TL2, TL3) show an interesting pattern: At larger values of
(smaller branch length prior means), the relative partition-specific tree lengths appear consistent with typical protein-gene evolution—very slow second-position sites, very fast third position sites. As the branch length prior increases, overall TL increases, but largely due to changes in the first and second partition rates, so that by the time the prior mean reaches 1 (analysis Q,
= 1) the posterior TL estimates for these partitions greatly exceed the third-partition tree length. The high first-and second-partition TL estimates are accompanied by unusually small estimates of
, the same pattern observed for the partitioned analyses without APRV in Tables 1 and 4. Note that the branch length prior in analysis R is the default value for MrBayes 3.0b4. In our experience these shifts are sometimes, but not always, accompanied by signs of poor convergence such as unstable parameter estimates and low effective sample sizes when analyzed with Tracer v. 1.3 (Rambaut and Drummond, 2003). We observed similar sensitivity to the branch length prior in analyses that accommodated APRV by use of the partition rate multiplier (data not shown). These results indicate that incorporation of APRV into partitioned Bayesian analyses must be done with caution. Ideally, several MCMC analyses should be conducted to explore sensitivity of the partition rate multipliers or partition-specific tree lengths to the branch length prior.
|
| Nonidentifiability and Sensitivity of Tree Length to the Branch Length Prior |
|---|
|
|
|---|
Branch lengths were sensitive to the prior distribution in all analyses. Lower values of lambda led to longer trees, and vice versa (Table 5). Although this could mean that the branch lengths are nonidentifiable (Rannala, 2002), other factors indicate a more complex problem. Branch lengths vary considerably within our trees, yet relative branch lengths within the consensus phylograms appear insensitive to the choice of model or branch length prior across all analyses (e.g., Fig. 4). Only the total tree length changes in concert with the prior mean. Again, we find it striking that, depending on the partitioning scheme and accommodation of APRV, trees with nearly identical topology, relative branch lengths, and nodal support can be found with very different total tree lengths. (Note that Yang, 2005-82, has argued that consensus phylogram branch lengths from MrBayes 3 should be discarded in favor of new ones obtained by re-running the MCMC with the consensus topology fixed. In this paper we have chosen to use the MrBayes consensus TL values because these have been widely used in published Bayesian analyses and because we are not focusing on individual branch lengths.)
Our analyses did suggest lack of branch length "signal" in one particular situation: In the three-partition analyses with branch lengths unlinked (e.g., Table 1, analysis C), individual branches within the second-partition consensus tree were unusually similar in length (Fig. 5) compared to the branches in the overall consensus phylogram (Fig. 4a), creating a "ladderlike" appearance of the tree. Many of the branches in the second-partition phylogram have lengths that are similar to or slightly less than the mean of the branch length prior, suggesting that the prior is biasing these parameter estimates. Nonidentifiability of individual branch lengths in the slower-evolving partitions may help to explain why the analyses with unlinked branch lengths did not outperform the rate multiplier analyses in mean log-likelihood.
|
Although applying rate multipliers and unlinking branch lengths both accommodate APRV about equally well, unlinking branch lengths will be a better strategy when different data partitions do not maintain the same relative evolutionary rates across the tree. This unusual pattern of among-lineage rate variation has been termed "mosaic evolution" by Simon et al. (1996) where it was demonstrated to occur in 5' versus 3' partitions of 12S rRNA of primates, cicadas and Drosophila. A similar phenomenon focusing on heterogeneity in rate among lineages at individual nucleotide sites has been described by Tuffley and Steel (1998) and Lockhart and colleagues (e.g., Lockhart et al., 1996, 2000), who called it covarion-like evolution, and Philippe and colleagues (e.g., Lopez et al., 2002; Gribaldo et al., 2003), who called it heterotachy. All of these phenomena are forms of covarion-like evolution (Lockhart and Steel, 2005) modified from the early ideas of Fitch and colleagues (e.g., Fitch and Markowitz, 1970; Fitch, 1986; Miyamoto and Fitch, 1995).
| Examples from Published Partitioned Analyses |
|---|
|
|
|---|
Some studies using MrBayes version 3 for partitioned Bayesian analyses have explicitly mentioned accommodation of APRV generally or use of the prset ratepr = variable option specifically (e.g., Nylander et al., 2004; Bergsten, 2005; Danforth et al., 2005; Ishiguro et al., 2005; Miya et al., 2005). However, many partitioned Bayesian studies do not mention if and/or how among-partition rate variation was modeled, and some may illustrate the tree-lengthening effect of failure to accommodate variable partition rates. For example, Brandley et al. (2005) demonstrated the superiority of partitioned models for resolving phylogenetic relationships of scincid lizards; their most-partitioned analysis yielded significantly better log-likelihoods and resolved deeper nodes in the tree more effectively than an unpartitioned analysis. The study employed a large amount of protein-coding data partitioned by codon position, and the default MrBayes option of prset ratepr = fixed was used (M. Brandley, personal communication). Judging from the scale bars in the figures, the tree presented for their most-partitioned analysis (fig. 4 in Brandley et al. 2005) is approximately twice as long as the tree from the unpartitioned analysis (their fig. 3). The Brandley et al. (2005) study is unusual in that the authors presented trees from both partitioned and unpartitioned analyses, without which it would be difficult to determine if APRV had influenced the branch lengths. (Note that the conclusions of Brandley et al., 2005, about the importance of data partitioning would not have been affected.) A tree from a partitioned Bayesian analysis of Malagasy scincid lizards (Schmitz et al., 2005) appears surprisingly long for a study encompassing a group of closely related genera, with a total branch length of approximately 1.3 expected substitutions per site from the base of the tree to the average tip. No mention was made of the MrBayes ratepr = variable option or unlinking branch lengths.
Older Bayesian analyses using MrBayes version 2 (e.g., Leys et al., 2002; Castoe et al., 2004) are unlikely to have been influenced by the specific tree-lengthening problem discussed in this paper, because data partitioning was allowed only by way of the site-specific rates option. The MrBayes 3.0 prset ratepr = variable option appears to be very similar to the site-specific rates feature of MrBayes 2.0.
| Importance of Accurate Branch Length Estimation to Studies of Divergence Times |
|---|
|
|
|---|
Divergence time estimation is a frequent goal of molecular phylogenetic studies. Earlier studies produced biased estimates of time (e.g., Doolittle et al., 1996) by failing to account for among-site rate variation (Gogarten et al., 1996; Hasegawa and Fitch, 1996), by failing to correct for multiple hits (e.g., any dating study that uses evenly weighted parsimony branch lengths), or by failing to relax the assumption that different lineages evolve at different rates. For these reasons, it is now generally accepted that best-fitting models that accommodate these biases are necessary for molecular dating (Arbogast et al., 2002; Welch and Bromham, 2005). Yang and Yoder (2003) examined the effects of partitioning protein-coding genes by codon position and estimating dates from these multiple partitions. They found that date estimates varied among codon positions when dates were estimated separately for each but that a combined analysis using a realistic model produced the best results.
Our study has demonstrated that estimates of tree length (the product of rate and time) in analyses with significant APRV are dramatically affected by model assumptions (see Welch et al., 2005, for a related study involving evolutionary rate priors), raising the possibility that divergence time estimates may also be affected if an external rate calibration is used. Fortunately, the most commonly used methods of divergence time estimation are not likely to encounter the problems discussed above. Many dating studies use the Bayesian program MULTIDIVTIME (Thorne et al., 1998; Kishino et al., 2001; Thorne and Kishino, 2002), for which branch lengths are separately estimated for each data partition, eliminating the need for concern about APRV. Penalized likelihood analyses (e.g., R8S; Sanderson, 2002) begin with a user-provided consensus phylogram, but in our empirical examples the relative branch lengths did not usually vary across partitions (but see Fig. 5) so failure to accommodate APRV would not necessarily affect the performance of the algorithm. Tree length errors generated by failure to explicitly model APRV are more likely to influence studies that apply externally estimated "standard" molecular-clock rates (e.g., Brower, 1994) to branch lengths estimated under partitioned models as well as studies that measure evolutionary rates using fossil-calibrated trees in which there are few calibrations and calibrations are located toward the tips of the trees.
| How to Partition? |
|---|
|
|
|---|
The biggest problem with data partitioning is deciding how to partition. The decision is easier in protein coding genes where the triplet code imposes a codon structure. Yang and Yoder (2003) found that partitioning according to codon position yielded different estimates of molecular dates. Brandley et al. (2005) similarly found that the most important factor in their protein coding partitioning strategy was to separate the rapidly evolving third positions from the slower first and second positions.
Partitioning of ribosomal genes is less intuitive. Although some authors have partitioned rRNA into stems (helices) versus loops (unpaired regions) (e.g., Wheeler and Honeycutt, 1988; Vawter and Brown, 1993; Springer and Douzery, 1996), this does not make sense in terms of evolutionary rate because there is a continuum of variability in stems and loops with some stems and loops evolving quickly on average while others on average remain highly conserved (Peer et al., 1993; Simon et al., 1994; Hickson et al., 1996; Pagel and Meade, 2004). Even more problematic is the presence of rapidly evolving sites in otherwise slowly evolving stems and vice versa (Simon et al., 1996). Models that accommodate correlation among sites (base pairing in stems) (Kjer, 2004) and other components of secondary structure (Jow et al., 2002; Hudelot et al., 2003; Telford et al., 2005) may help address some of these difficulties. Perhaps the best approach to modeling rRNA is to use mixture models that do not require a priori specification of partitions to retrieve signals of pattern heterogeneity (Pagel and Meade, 2004). In mixture models, characters are viewed as having been generated by one of a number of distributions and each of these distributions contributes to the likelihood during an analysis. The parameters of each model distribution and the weights assigned to them are estimated from the data. In the end of the analysis each character can be assigned a probability of membership in each model. Mixture models have also been applied to protein coding genes (Pagel and Meade, 2004) and amino acid sequence evolution (Lartillot and Philippe, 2004), and they may offer an alternative means of avoiding the tree length issues encountered in partitioned Bayesian studies. Preliminary analyses of the Kikihia mtDNA data set (unpublished data) using the program BayesPhylogenies (Pagel and Meade, 2004) yielded tree length estimates comparable to those obtained in Bayesian analyses that accommodated APRV, using mixed models with just two or three "patterns" (or matrices).
| Appendix 1. Methods |
|---|
|
|
|---|
Effect of Accommodation of APRV on Bayesian Analyses: Empirical Data
The effect of accommodation of among-partition rate variation (APRV) on tree length and other Bayesian parameter estimates was examined using 2152 bp of mtDNA COI, COII, ATPase 6, and ATPase 8 from 32 New Zealand cicada taxa of the genus Kikihia (TreeBase submission SN3085-12924). The sequences have been deposited in GenBank (accession numbers: COI: EF051355 [GenBank] –EF051385 [GenBank] ; COII: EF051386 [GenBank] –EF051416 [GenBank] , A6A8: EF051417 [GenBank] –EF051446 [GenBank] ). Using MrBayes 3.0b4 (Huelsenbeck and Ronquist, 2001), single-and three-partition analyses were conducted with or without APRV accommodated by a partition rate multiplier (each partition assigned a separate rate multiplier, mn) or by unlinking branch lengths (each partitioned assigned a separate tree length, TLn). These were designated analyses A to D. In the three-partition analyses each codon-position was modeled separately (first position sites in partition 1, etc.). An additional three-partition analyses was then conducted (analysis E) in which "random" partitions were used, each containing an equal number of sites from each codon position. Each data partition was separately modeled (parameters unlinked) with a general time-reversible (GTR: Yang, 1994a) sequence evolution model with invariant sites (I; Hasegawa et al., 1985) and with among-site rate variation (ASRV) modeled by a discrete gamma approximation; (
; Yang, 1994b) with eight rate categories. All analyses used a branch length prior of unconstrained:exponential(75) and a shape parameter (
) prior of exp(1). All other parameters were set to the MrBayes default values. For the unlinked branch length analysis (analysis C), the consensus tree length (TLALL, which is not found in the MrBayes output) was calculated by averaging the partition-specific tree lengths (TL1,... TLn), weighted by the number of sites in each partition. Each MrBayes analysis used four chains, one "cold" and three "heated", run for 3 x 106 to 8 x 106 generations (depending on number of parameters in the model) and sampled every 1000 generations; a burnin of approximately 25% was used for most analyses. Stationarity was checked by monitoring post-burnin stability of mean log-likelihood and individual parameter estimates with Tracer v1.3 (Rambaut and Drummond, 2003) and by ensuring that duplicate analyses converged on similar outcomes.
Effect of Accommodation of APRV on Bayesian Analyses: Simulated Data
We also examined the relationship between tree length and degree of among-partition rate variation using three-partition data sets of 2400 bp each simulated on a known tree (Fig. 1) under three different levels of APRV (relative partition rates 1:1:1, 3:1:9, and 5:1:25 for treatments A, B, and C, respectively). Data were simulated using the program Seq-Gen v1.3.2 (Rambaut and Grassly, 1997). Twenty datasets were simulated for each of the three APRV treatments for a total of 60 simulated data sets (representative data sets are summarized in Table 3). Within each treatment, the twenty simulated data sets were each analyzed under three partitioned models differing only in the method of accommodation of APRV (no-APRV, rate-multiplier m, and unlinked branches), and the mean log-likelihood and model parameter estimates from the 20 analyses were averaged for each of these three models. The HKY+
model was used in data simulation (and Bayesian phylogenetic analysis) to limit the number of parameters involved, and a small number of taxa (6) were used to accelerate the analyses. The following model parameters were chosen for the simulations:
= 2.0,
= 1.0, four rate categories, equal base frequencies (
), 800 bp per partition (2400 bp total), overall TL = 0.64. Partition-specific per-site substitution rates (TL1, TL2, TL3) were simulated as follows: Treatment A (1:1:1) – 0.64, 0.64, 0.64; treatment B (3:1:9) – 0.44, 0.15, 1.32; treatment C (5:1:25)—0.31, 0.06, 1.54. Base frequency (
) estimates of different models did not vary and were omitted from the results for brevity.
Phylogenetic analyses were conducted with MrBayes v3.1 (Ronquist and Huelsenbeck, 2003) to allow use of the chain-convergence diagnostics offered by this version of the program. The parameters
and
were estimated separately for each partition (but the base frequencies were "linked" to reduce the number of parameters involved in the test) and the prior means for
(exp[1]) and
(beta[1,1]) were set to match the mean values used in the data simulations. A branch length prior of exp(5.247) was used in all analyses. All other parameters were set to the MrBayes default values. MCMC was run for 1.5 million generations in each analysis and 0.5 million generations were discarded as Burn-in. Stationarity was confirmed for all analyses by monitoring the stability of post-burn-in mean log-likelihood and parameter estimates and (for a limited number of analyses of each treatment) ensuring that the uncorrected potential scale reduction factors (PSRF; Gelman and Rubin, 1992) remained close to 1.0 in analyses with the MrBayes "nruns" parameter set to 2; values of PSRF usually did not exceed 1.01. (Yang, 2005-80, suggests that values less than 1.1 or 1.2 indicate convergence "in real data problems".) MrBayes analyses employing rate-multipliers do not return partition-specific tree lengths; instead, the product of the estimated rate mutiplier (mn) for each partition (not shown) and TL is given here to facilitate comparison between results from different models. Similarly, TLavg (the average of the three partition-specific per-site substitution rates obtained from an unlinked branch length analysis) is included here for comparison with the TL values returned by the rate multiplier analyses. Base frequency (
) estimates of different models did not vary and were omitted from the results presented here for brevity.
| Acknowledgments |
|---|
|
|
|---|
We thank two anonymous reviewers, Allan Baker, and Rod Page for helpful comments on the manuscript. Paul O. Lewis contributed guidance on likelihood-based models and Bayesian analysis. Mark Pagel provided helpful discussions on mixture models and made the latest version of his program available to us. Matt Brandley kindly read a draft of the manuscript and provided useful information. Computational resources for the phylogenetic analyses were provided by the Bioinformatics Facility of the Biotechnology/Bioservices Center, University of Connecticut, Storrs, CT. This material is based upon work partially supported by the National Science Foundation under grant No. DEB-0422386 to CS. TB was supported by funding from the Foundation for Research, Science and Technology (contract number C09X0501).
| References |
|---|
|
|
|---|
-
Arbogast B. S., Edwards S. V., Wakeley J., Beerli P., Slowinski J. B. Estimating divergence times from molecular data on phylogenetic and population genetic timescales. Annu. Rev. Ecol. Syst. (2002) 33:707–740.[CrossRef][Web of Science]
Barker F. K., Lutzoni F. M. The utility of the incongruence length difference test. Syst. Biol. (2002) 51:625–637.
Bergsten J. A review of long-branch attraction. Cladistics (2005) 21:163–193.[CrossRef][Web of Science]
Brandley M. C., Schmitz A., Reeder T. W. Partitioned Bayesian analyses, partition choice, and the phylogenetic relationships of scincid lizards. Syst. Biol. (2005) 54:373–390.
Brower A. V. Z. Rapid morphological radiation and convergence among races of the butterfly Heliconius erato inferred from patterns of mitochondrial DNA evolution. Proc. Natl. Acad. Sci. USA (1994) 91:6491–6495.
Buckley T. R., Arensburger P., Simon C., Chambers G. K. Combined data, Bayesian phylogenetics, and the origin of the New Zealand Cicada genera. Syst. Biol. (2002) 51:4–18.
Buckley T. R., Cunningham C. W. The effects of nucleotide substitution model assumptions on estimates of nonparametric bootstrap support. Mol. Biol. Evol. (2002) 19:394–405.
Buckley T. R., Simon C., Chambers G. K. Exploring among-site rate variation models in a maximum likelihood framework using empirical data: Effects of model assumptions on estimates of topology, branch lengths, and bootstrap support. Syst. Biol. (2001) 50:67–86.
Bull J., Huelsenbeck J. P., Cunningham C. W., Swofford D. L., Waddell P. J. Partitioning and combining data in phylogenetic analysis. Syst. Biol. (1993) 42:384–397.
Castoe T. A., Doan T. M., Parkinson C. L. Data partitions and complex models in Bayesian analysis: The phylogeny of Gymnophthalmid lizards. Syst. Biol. (2004) 53:448–469.
Cunningham C. W. Can three incongruence tests predict when data should be combined? Mol. Biol. Evol. (1997) 14:733–740.[Abstract]
Cunningham C. W., Zhu H., Hillis D. M. Best-fit maximum likelihood models for phylogenetic inference: Empirical tests with known phylogenies. Evolution (1998) 52:978–987.[CrossRef][Web of Science]
Danforth B. N., Lin C. P., Fang J. How do insect nuclear ribosomal genes compare to protein-coding genes in phylogenetic utility and nucleotide substitution patterns? Syst. Entomol. (2005) 30:549–562.
Doolittle R. F., Feng D. F., Tsang S., Cho G., Little E. Determining divergence times of the major kingdoms of living organisms with a protein clock. Science (1996) 271:470–477.[Abstract]
Dowton M., Austin A. Increased congruence does not necessarily indicate increased phylogenetic accuracy—the behavior of the incongruence length difference test in mixed-model analysis. Syst. Biol. (2002) 51:19–31.
Fitch W. M. The estimate of total nucleotide substitutions from pairwise differences is biased. Philos. Trans. R. Soc. Lond. Ser. B Biol. Sci. (1986) 312:317–324.
Fitch W. M., Markowitz E. An improved method for determining codon variability in a gene and its application to the rate of fixation of mutations in evolution. Biochem. Genet. (1970) 4:579–593.[CrossRef][Web of Science][Medline]
Gelman A., Rubin D. B. Inference from iterative simulation using multiple sequences (with discussion). Stat. Sci. (1992) 7:457–511.[CrossRef]
Gogarten J. P., Olendzenski L., Hilario E., Simon C., Holsinger K. E. Dating the cenancestor of organisms. Science (1996) 274:1750–1751.[CrossRef][Web of Science][Medline]
Gribaldo S., Casane D., Lopez P., Philippe H. Functional divergence prediction from evolutionary analysis: A case study of vertebrate hemoglobin. Mol. Biol. Evol. (2003) 20:1754–1759.
Hasegawa M., Fitch W. M. Dating the cenancester of organisms. Science (1996) 274:1750.[CrossRef][Web of Science][Medline]
Hasegawa M., Kishino H., Yano T. Dating the human-ape split by a molecular clock by mitochondrial DNA. J. Mol. Evol. (1985) 22:160–174.[CrossRef][Web of Science][Medline]
Hickson R., Simon C., Cooper A., Spicer G., Sullivan J., Penny D. Conserved sequence motifs, alignment, and secondary structure for the third domain of animal 12S rRNA. Mol. Biol. Evol. (1996) 13:150–169.[Abstract]
Hudelot C., Gowri-Shankar V., Jow H., Rattray M., Higgs P. G. RNA-based phylogenetic methods: Application to mammalian mitochondrial RNA sequences. Mol. Phylogenet. Evol. (2003) 28:241–252.[CrossRef][Web of Science][Medline]
Huelsenbeck J. P., Bull J. J. A likelihood ratio test to detect conflicting phylogenetic signal. Syst. Biol. (1996) 45:92–98.
Huelsenbeck J. P., Rannala B., Yang Z. Statistical tests of host-parasite cospeciation. Evolution (1997) 51:410–419.[CrossRef][Web of Science]
Huelsenbeck J. P., Ronquist F. MrBayes: Bayesian inference of phylogeny. Bioinformatics (2001) 17:754–755.
Huelsenbeck J. P., Swofford D. L., Cunningham C. W., Bull J. J., Waddell P. J. Is character weighting a panacea for the problem of data heterogeneity in phylogenetic analysis? Syst. Biol. (1994) 43:288–291.
Ishiguro N. B., Miya M., Inoue J. G., Nishida M. Sundasalanx (Sundasalangidae) is a progenetic clupeiform, not a closely-related group of salangids (Osmeriformes): Mitogenomic evidence. J. Fish Biol. (2005) 67:561–569.[CrossRef][Web of Science]
Jow H., Hudelot C., Rattray M., Higgs P. G. Bayesian phylogenetics using an RNA substitution model applied to early mammalian evolution. Mol. Biol. Evol. (2002) 19:1591–1601.
Kass R. E., Raftery A. E. Bayes factors. J. Am. Stat. Assoc. (1995) 90:773–795.[CrossRef][Web of Science]
Kishino H., Thorne J. L., Bruno W. J. Performance of a divergence time estimation method under a probabilistic model of rate evolution. Mol. Biol. Evol. (2001) 18:352–361.
Kjer K. Aligned 18S and insect phylogeny. Syst. Biol. (2004) 53:506–514.
Kuhner M. J., Felsenstein J. A simulation comparison of phylogeny algorithms under equal and unequal evolutionary rates. Mol. Biol. Evol. (1994) 11:459–468.[Abstract]
Lartillot N., Philippe H. A Bayesian mixture model for across-site heterogeneities in the amino-acid replacement process. Mol. Biol. Evol. (2004) 21:1095–1109.
Lemmon A. R., Moriarty E. C. The importance of proper model assumption in Bayesian phylogenetics. Syst. Biol. (2004) 53:265–277.
Leys R., Cooper S. J. B., Schwarz M. P. Molecular phylogeny and historical biogeography of the large carpenter bees, genus Xylocopa (Hymenoptera: Apidae). Biol. J. Linn. Soc. (2002) 77:249–266.[CrossRef][Web of Science]
Lockhart P. J., Huson D., Maier U., Fraunholz M. J., Peer Y. V. D., Barbrook A. C., Howe C. J., Steel M. A. How molecules evolve in Eubacteria. Mol. Biol. Evol. (2000) 17:835–838.
Lockhart P. J., Larkum A. W. D., Steel M. A., Waddell P. J., Penny D. Evolution of chlorophyll and bacteriochlorophyll: The problem of invariant sites in sequence analysis. Proc. Natl. Acad. Sci. USA (1996) 93:1930–1934.
Lockhart P. J., Steel M. A. A tale of two processes. Syst. Biol. (2005) 54:948–951.
Lopez P., Casane D., Philippe H. Heterotachy, an important process of protein evolution. Mol. Biol. Evol. (2002) 19:1–7.
Miya M., Satoh T. P., Nishida M. The phylogenetic position of toadfishes (order Batrachoidiformes) in the higher ray-finned fish as inferred from partitioned Bayesian analysis of 102 whole mitochondrial genome sequences. Biol. J. Linn. Soc. (2005) 85:289–306.[CrossRef][Web of Science]
Miyamoto M., Fitch W. M. Testing species phylogenies and phylogenetic methods with congruence. Syst. Biol. (1995) 44:64–76.
Nylander J. A. A., Ronquist F., Huelsenbeck J. P., Nieves-Aldrey J. L. Bayesian phylogenetic analysis of combined data. Syst. Biol. (2004) 53:47–67.
Pagel M., Meade A. A phylogenetic mixture model for detecting pattern-heterogeneity in gene sequence or character-state data. Syst. Biol. (2004) 53:571–581.
Peer Y. v. d., Neefs J. M., Rijk P. d., Wachter R. d. Reconstructing evolution from eukaryotic small-ribosomal-subunit RNA sequences: Calibration of the molecular clock. J. Mol. Evol. (1993) 37:221–232.[CrossRef][Web of Science][Medline]
Philippe H., Zhou T., Brinkmann H., Rodrique N., Delsuc F. Heterotachy and long-branch attraction in phylogenetics. BMC Evol. Biol. (2005) 5:50.[CrossRef][Medline]
Rambaut A., Drummond A. J. Tracer v1.3 (2003) http://evolve.zoo.ox.ac.uk/Available from.
Rambaut A., Grassly N. C. Seq-Gen: An application for the Monte-Carlo simulation of DNA sequence evolution along phylogenetic trees. Comput. Appl. Biosci. (1997) 13:235–238.
Rannala B. Identifiability of parameters in MCMC Bayesian inference of phylogeny. Syst. Biol. (2002) 51:754–760.
Ronquist F., Huelsenbeck J. P. MrBayes 3: Bayesian phylogenetic inference under mixed models. Bioinformatics (2003) 19:1572–1574.
Sanderson M. J. Estimating absolute rates of molecular evolution and divergence times: A penalized likelihood approach. Mol. Biol. Evol. (2002) 19:101–109.
Schmitz A., Brandley M. C., Maufield P., Vences M., Glaw F., Nussbaum R. A., Reeder T. W. Opening the black box: Phylogenetics and morphological evolution of the Malagasy fossorial lizards of the subfamily "Scincidae". Mol. Phylogenet. Evol. (2005) 34:118–133.[CrossRef][Web of Science][Medline]
Simon C., Frati F., Beckenbach A., Crespi B., Liu H., Flook P. Evolution, weighting, and phylogenetic utility of mitochondrial gene sequences and a compilation of conserved PCR primers. Ann. Entomol. Soc. Am. (1994) 87:651–701.[Web of Science]
Simon C., Nigro L., Sullivan J., Holsinger K., Martin A., Grapputo A., Franke A., McIntosh C. Large differences in substitutional pattern and evolutionary rate of 12S ribosomal RNA genes. Mol. Biol. Evol. (1996) 13:923–932.[Abstract]
Springer M. S., Douzery E. Secondary structure and patterns of evolution among mammalian mitochondrial 12S rRNA molecules. J. Mol. Evol. (1996) 43:357–373.[Web of Science][Medline]
Sullivan J., Holsinger K. E., Simon C. Among-site rate variation and phylogenetic analysis of 12S rRNA in Sigmodontine rodents. Mol. Biol. Evol. (1995) 12:988–1001.[Abstract]
Telford M. J., Wise M. J., Gowri-Shankar V. Consideration of RNA secondary structure significantly improves likelihood-based estimates of phylogeny: Examples from the Bilateria. Mol. Biol. Evol. (2005) 22:1129–1136.
Thorne J. L., Kishino H. Divergence time and evolutionary rate estimation with multilocus data. Syst. Biol. (2002) 51:689–702.
Thorne J. L., Kishino H., Painter I. S. Estimating the rate of evolution of the rate of molecular evolution. Mol. Biol. Evol. (1998) 15:1647–1657.[Abstract]
Tuffley C., Steel M. A. Modeling the covarion hypothesis of nucleotide substitution. Math. Biosci. (1998) 147:63–91.[CrossRef][Web of Science][Medline]
Vawter L., Brown W. M. Rates and patterns of base change in the SSU ribosomal RNA gene. Genetics (1993) 134:597–608.[Abstract]
Welch J. J., Bromham L. Molecular dating when rates vary. Trends Ecol. Evol. (2005) 20:320–327.[CrossRef][Medline]
Welch J. J., Fontanillas E., Bromham L. Molecular dates for the "Cambrian Explosion": The influence of prior assumptions. Syst. Biol. (2005) 54:672–678.
Wheeler W. C., Honeycutt R. L. Paired sequence difference in ribosomal RNAs: Evolutionary and phylogenetic implications. Mol. Biol. Evol. (1988) 5:90–96.[Abstract]
Yang Z. Estimating the pattern of nucleotide substitution. J. Mol. Evol. (1994a) 39:105–111.[Web of Science][Medline]
Yang Z. Maximum likelihood phylogenetic estimation from DNA sequences with variable rates over sites: Approximate methods. J. Mol. Evol. (1994b) 39:306–314.[CrossRef][Web of Science][Medline]
Yang Z. Bayesian inference in molecular phylogenetics. In: Mathematics of evolution and phylogeny—Gascuel O., ed. (2005) Oxford: Oxford University Press. 63–90.
Yang Z. Maximum likelihood models for combined analysis of multiple sequence data. J. Mol. Evol. (1996) 42:587–596.[CrossRef][Web of Science][Medline]
Yang Z., Yoder A. D. Comparison of likelihood and Bayesian methods for estimating divergence times using multiple gene loci and calibration points, with application to a radiation of cute-looking mouse lemur species. Syst. Biol. (2003) 52:705–716.
This article has been cited by other articles:
![]() |
D. C. Marshall Cryptic Failure of Partitioned Bayesian Phylogenetic Analyses: Lost in the Land of Long Trees Syst Biol, November 17, 2009; (2009) syp080v1. [Abstract] [Full Text] [PDF] |
||||
![]() |
T. E. Roberts, E. J. Sargis, and L. E. Olson Networks, Trees, and Treeshrews: Assessing Support and Identifying Conflict with Multiple Loci and a Problematic Root Syst Biol, June 16, 2009; (2009) syp025v3. [Abstract] [Full Text] [PDF] |
||||
![]() |
P. Q. Spinks and H. B. Shaffer Conflicting Mitochondrial and Nuclear Phylogenies for the Widely Disjunct Emys (Testudines: Emydidae) Species Complex, and What They Tell Us about Biogeography and Hybridization Syst Biol, May 28, 2009; (2009) syp005v1. [Abstract] [Full Text] [PDF] |
||||
![]() |
C. C. Steiner, H. Rompler, L. M. Boettger, T. Schoneberg, and H. E. Hoekstra The Genetic Basis of Phenotypic Convergence in Beach Mice: Similar Pigment Patterns but Different Genes Mol. Biol. Evol., January 1, 2009; 26(1): 35 - 45. [Abstract] [Full Text] [PDF] |
||||
![]() |
C. R. Linnen and B. D. Farrell Comparison of Methods for Species-Tree Inference in the Sawfly Genus Neodiprion (Hymenoptera: Diprionidae) Syst Biol, December 1, 2008; 57(6): 876 - 890. [Abstract] [Full Text] [PDF] |
||||
![]() |
M. Dohrmann, D. Janussen, J. Reitner, A. G. Collins, and G. Worheide Phylogeny and Evolution of Glass Sponges (Porifera, Hexactinellida) Syst Biol, June 1, 2008; 57(3): 388 - 405. [Abstract] [Full Text] [PDF] |
||||
![]() |
J. A. Clarke and K. M. Middleton Mosaicism, Modules, and the Evolution of Birds: Results from a Bayesian Approach to the Study of Morphological Evolution Using Discrete Character Data Syst Biol, April 1, 2008; 57(2): 185 - 201. [Abstract] [Full Text] [PDF] |
||||
![]() |
J. A. McGuire, C. C. Witt, D. L. Altshuler, and J. V. Remsen Phylogenetic Systematics and Biogeography of Hummingbirds: Bayesian and Maximum Likelihood Analyses of Partitioned Data and Selection of an Appropriate Partitioning Strategy Syst Biol, October 1, 2007; 56(5): 837 - 856. [Abstract] [Full Text] [PDF] |
||||
![]() |
J. M. Brown and A. R. Lemmon The Importance of Data Partitioning and the Utility of Bayes Factors in Bayesian Phylogenetics Syst Biol, August 1, 2007; 56(4): 643 - 655. [Abstract] [Full Text] [PDF] |
||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||






