Skip Navigation

Systematic Biology 2006 55(2):195-207; doi:10.1080/10635150500433722
This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (PDF) Freely available
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in ISI Web of Science
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrow Search for citing articles in:
ISI Web of Science (31)
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Lartillot, N.
Right arrow Articles by Philippe, H.
Right arrow Search for Related Content
PubMed
Right arrow Articles by Lartillot, N.
Right arrow Articles by Philippe, H.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?

© 2006 Society of Systematic Biologists

Computing Bayes Factors Using Thermodynamic Integration

Edited by Paul Lewis: Associate Editor

Nicolas Lartillot1 and Hervé Philippe2

1 Laboratoire d'Informatique, de Robotique et de Microélectronique de Montpellier UMR 5506, CNRS-Université de Montpellier 2, 161, rue Ada, 34392 Montpellier Cedex 5, France; E-mail: nicolas.lartillot{at}lirmm.fr
2 Canadian Institute for Advanced Research, Université de Montréal Département de Biochimie Montréal, Québec, Canada


    Abstract
 Top
 Abstract
 Data and Models
 Marginal Likelihood Estimation
 Comparing Importance Sampling...
 Application: Models of Amino...
 Discussion
 References
 
In the Bayesian paradigm, a common method for comparing two models is to compute the Bayes factor, defined as the ratio of their respective marginal likelihoods. In recent phylogenetic works, the numerical evaluation of marginal likelihoods has often been performed using the harmonic mean estimation procedure. In the present article, we propose to employ another method, based on an analogy with statistical physics, called thermodynamic integration. We describe the method, propose an implementation, and show on two analytical examples that this numerical method yields reliable estimates. In contrast, the harmonic mean estimator leads to a strong overestimation of the marginal likelihood, which is all the more pronounced as the model is higher dimensional. As a result, the harmonic mean estimator systematically favors more parameter-rich models, an artefact that might explain some recent puzzling observations, based on harmonic mean estimates, suggesting that Bayes factors tend to overscore complex models. Finally, we apply our method to the comparison of several alternative models of amino-acid replacement. We confirm our previous observations, indicating that modeling pattern heterogeneity across sites tends to yield better models than standard empirical matrices.

Keywords: Bayes factor; harmonic mean; mixture model; path sampling; phylogeny; thermodynamic integration

Received March 4, 2005; Revised May 19, 2005; Accepted September 16, 2005


Bayesian methods have become popular in molecular phylogenetics over the recent years. The simple and intuitive interpretation of the concept of probabilities underlying the Bayesian paradigm makes it an appealing framework of scientific inference in general (Jaynes 2003). On the other hand, the Bayesian practice also entails mathematical difficulties, which have prevented its use in most practical fields until recently. Over the last 10 years, the situation has changed, mainly due to the impressive advances in computational power. In addition, general numerical methods based on Markov chains Monte Carlo (MCMC) have been developed, allowing one to conduct Bayesian inferences under a large category of probabilistic models, with few constraints on dimensionality or analytical integrability (Gelman et al., 2004; Holder and Lewis, 2003; Huelsenbeck et al., 2002).

However, this new freedom in model exploration has to be complemented by efficient and reliable methods of model evaluation and selection. More fundamentally, it raises the question of whether devising more complex evolutionary models is indeed relevant in the first place, given the problems that such a project might imply (Rannala, 2002). At first sight, current phylogenetic models offer a good compromise between complexity and tractability. They account for unequal rates of substitution among amino acids through general time-reversible matrices determined empirically or directly inferred from the data (Jones et al., 1992; Whelan and Goldman, 2001). They also allow different positions along the sequence to evolve at different speeds (Yang, 1993, 1994). Both aspects seem to have a significant impact on the quality of the phylogenetic estimates (Brinkmann et al., 2005; Yang, 1996). Yet, this may not be sufficient, as evidenced by all the inconsistencies still observed in many phylogenetic analyses (Gaut and Lewis, 1995; Philippe et al., 2005; Stefanovic, 2004; Sullivan and Swofford, 1997). In an otherwise coherent statistical framework, such as maximum likelihood or the Bayesian method, inconsistencies are a clear indication of model misspecifications, suggesting that some simplifying assumptions common to most current models (e.g., absence of gene conversions or lateral gene transfers, homogeneity of the equilibrium frequencies across sites, stationarity of the substitution process across lineages) may need to be relaxed as well. Hence, in the aim of obtaining more reliable phylogenetic inference, a wider diversity of models than those currently considered has still to be investigated, calling for good methods to perform both parameter estimation and reliable model choice.

Bayesian inference is in general tantamount to exploring the posterior probability distribution over the parameters of interest. Given a model M, with parameter vector {theta} isin {Theta} (specifying, for instance, the tree topology and branch lengths), and applied on a dataset D, the posterior probability distribution is given by Bayes' theorem:


Formula 1

(1)
where p({theta} | M) is the prior distribution, p(D | {theta}, M) the likelihood function, and


Formula 2

(2)
is the normalization constant, also called the predictive probability, or marginal likelihood.

As for model fit, the normalization constant, p(D | M), is of primary importance. As a function of M, it can literaly read as the likelihood of model M, given the data D. Accordingly, among several models, one is led to choose the one of greatest marginal likelihood. When two models M0 and M1 are being compared, one usually defines the Bayes factor in favor of M1 over M0 as the ratio of their respective marginal likelihoods (Jeffreys, 1935; Kass and Raftery, 1995):


Formula 3

(3)
Values of the Bayes factor greater (smaller) than 1 will be considered as evidence in favor of M1 (M0). Other approaches for evaluating model fit in a Bayesian context have been proposed, such as cross-validation (Stone, 1974), posterior predictive approaches (Gelman et al., 1996; Meng, 1994; Rubin, 1984) applied in phylogenetic model comparison (Bollback, 2002), as well as fractional (O'Hagan, 1995), posterior (Aitkin, 1991), or intrinsic (Berger and Pericchi, 1996) Bayes factors. But in the following, we will focus exclusively on the traditional Bayes factor, which is more intuitive in a model-likelihood interpretation perspective.

In practice, posterior expectations can be efficiently estimated by sampling from the posterior distribution, using, for instance, MCMC methods such as the Metropolis-Hastings or the Gibbs sampling algorithms. These methods are now applied extensively in molecular phylogenetics (Huelsenbeck and Ronquist, 2001; Larget and Simon, 1999; Lartillot and Philippe, 2004; Pagel and Meade, 2004; Suchard, 2001). In contrast, the numerical evaluation of the marginal likelihood, and thereby of the Bayes factor, is anything but easy, in particular for high dimensional models, and for large datasets (Han and Carlin, 2000; Kass and Raftery, 1995). Note, in this respect, that the MCMC algorithms used for posterior sampling only involve the ratio of two posterior probabilities (i.e., of the current and the newly proposed parameter value), in which the normalization constant p(D | M) cancels out:


Formula 4

(4)
This implies that these algorithms, however efficient at sampling from the posterior, do not allow one to estimate p(D | M) directly.

Among the methods available for evaluating Bayes' factors, many are valid only under very specific conditions. For instance, the Dickey-Savage ratio (Verdinelli and Wasserman, 1995), applied in phylogenetics (Suchard, 2001), assumes nested models. The Laplace estimator (Kass and Raftery, 1995), or the Bayesian Information Criterium (Schwartz, 1978), applied in phylogenetics (Minin et al., 2003; Waddell et al., 2002) are large sample approximations around the maximum likelihood, which cannot always be easily evaluated for complex models. The Laplace estimator (Kass and Raftery, 1995) relies on a normal approximation around the maximum likelihood (ML), which may not be valid for parameter-rich models. The reversible-jump approach (Green, 1995), where a MCMC is devised to jump between models according to the Metropolis-Hastings rule, can in principle be made as general as desired. Yet, in practice, the Metropolis-Hastings moves between models have to be accepted at a sufficient rate for the method to be practical. This requirement can be met quite easily as long as the models being compared are formulated along similar parameterizations; for instance, alternative substitution matrices (Huelsenbeck et al., 2004), or different number of classes for a mixture model (Lartillot and Philippe, 2004). In contrast, the reversible-jump method is not easily applicable when comparing models based on an entirely different parametric rationale.

We are thus left with only a few methods of potentially general applicability, among which (1) the importance sampling estimators, and particularly the harmonic mean estimator (HME) (Newton and Raftery, 1994), and (2) thermodynamic integration, or path sampling (Gelman, 1998; Ogata, 1989). The HME is by far the simplest method, only requiring a sample from the posterior distribution. It has been applied repeatedly, in particular in phylogenetic model comparison (Irestedt et al., 2004; Nylander et al., 2004; Pagel and Meade, 2004). Because its variance may be infinite, a modified, stabilized version has also been proposed (Newton and Raftery, 1994), also used in phylogenetics (Suchard et al., 2003). Thermodynamic integration, on the other hand, is based on a completely different rationale, relies on a more elaborate and computationally more intensive MCMC sampling scheme, but is statistically more well-behaved (Gelman, 1998). Its name stems from an analogy with physics, where the marginal likelihood is equivalent to the so-called partition function and its logarithm to the free energy. In fact, physicists have had to evaluate probabilities formulated in terms of high-dimensional integrals for a long time now (Neal, 2000). Therefore, transposing their well-tried methods into other numerical problems could be a promising approach.

In this work, we have implemented the HME and the method of thermodynamic integration. We have applied them to the comparison of models of sequence evolution. We show by several means that, whereas thermodynamic integration yields reliable quantitative estimates of Bayes' factors, the HME is unreliable and can even lead to qualitative reversions of the comparisons being conducted. Altogether, considering that some Bayes' factor evaluations performed in a phylogenetic context thus far have relied on the harmonic estimator, we advocate that more caution should be applied, and that thermodynamic integration, or other methods not investigated here, should be used instead. Finally, using thermodynamic integration, we compare several models of amino-acid replacement, among which are site-heterogeneous models that we have proposed previously (Lartillot and Philippe, 2004).


    Data and Models
 Top
 Abstract
 Data and Models
 Marginal Likelihood Estimation
 Comparing Importance Sampling...
 Application: Models of Amino...
 Discussion
 References
 
Five datasets were considered in this study. The following nomenclature specifies, for each dataset, the type of protein, the number of taxa (P), and the length of the alignment (N):

  • PGK30-276: sequences of phosphoglycerate kinase of 30 eubacterial species
  • EF30-627: sequences of elongation factor 2 from 30 eukaryotes
  • POL39-888: RNA polymerase Rpb2 of 39 eubacteria
  • DLIG40-430: DNA ligase of 40 eubacteria
  • UVR30-719: DNA excision nuclease subunit A of 30 eubacteria
For each dataset, the amino-acid sequences were retrieved from the databases and aligned using ClustalW (Thompson et al., 1994). The alignments were hand-corrected using the MUST package (Philippe, 1993), and regions ambiguously aligned were removed with the help of the GBlocks program (Castresana, 2000).

We assume a uniform prior over topologies, and an exponential distribution on branch lengths, with mean determined by a hyperparameter {lambda}. Rates across sites can be either uniform (UNI model) or distributed according to a Invariant + Gamma (I+{Gamma}) distribution (RAS), in which case both the {alpha} parameter and the proportion of invariant sites (p0) are considered as free parameters. We propose three alternative sets of priors on hyperparameters:

  • P1: an exponential prior of mean 1 on {lambda} and on {alpha} (default prior)
  • P2: an exponential prior of mean 1 on {lambda} and on 1/{alpha}
  • P3: a fixed value for {lambda} ({lambda} = 10), and a flat prior on {alpha}, with the restriction that {alpha} < 100, for the prior to remain proper.
In all cases, we assume a uniform prior on p0.

For the amino-acid replacement model, we consider five different cases:

  • WAG: the WAG empirical matrix (Whelan and Goldman, 2001). Stationary probabilities (equilibrium frequencies) will either be set equal to the values reported in the original article (WAG model), or considered as free parameters, with a flat Dirichlet prior (WAG+F model).
  • Poisson: a Poisson process, which is characterized by its stationary probability vector. We use the same combination of stationary probabilities as for WAG (i.e., Poisson, or Poisson+F).
  • GTR: the most general time-reversible matrix, which is implemented as described previously (Lartillot and Philippe, 2004).
  • MAX: each site has its own amino-acid replacement matrix, which is a Poisson process, whose profile, defined by the 20 equilibrium frequencies, is a random variable distributed according to a flat Dirichlet.
  • CAT: the distribution of amino-acid replacement matrices across sites is modeled by a mixture of a free number of Poisson processes. Each component is defined by a stationary probability vector. The prior is specified by a Dirichlet process (Lartillot and Philippe, 2004).

General MCMC Settings
The methods and implementation for MCMC sampling under these models have been described previously (Lartillot and Philippe, 2004). Briefly, the different components of the parameter vector (topology, branch lengths, site-specific rates, stationary probability vectors, hyperparameters) are updated separately, according to a sequence of calls to all available update mechanisms. One such sequence defines a cycle. The number of cycles required for a given chain to reach its stationary equilibrium (burn-in), as well as the total number of cycles and the saving frequency, are first determined empirically. In a second step, the effective size of the sample is determined a posteriori by a time-series variance estimation method based on the empirical autocovariances of the log-likelihood time series (Geyer, 1992). We used a Tukey-Janning lag window (Raftery and Lewis, 1992), with a cutoff at K/4, where K is the total number of saved points. Given K and the decorrelation time {tau}, the effective size of the sample is then estimated as Keff = K/{tau}.

For each dataset, a first MCMC run under the WAG+F, I+{Gamma} model was conducted, and the consensus of 1,000 trees sampled from the posterior was computed. This consensus was then used for any analysis conducted under a fixed topology.

All source codes, data files, and trees are available from http://systematicbiology.org. Data matrices can also be downloaded from TreeBase (http://www.treebase.org, accession numbers S1388, M2476–M2480).


    Marginal Likelihood Estimation
 Top
 Abstract
 Data and Models
 Marginal Likelihood Estimation
 Comparing Importance Sampling...
 Application: Models of Amino...
 Discussion
 References
 
Importance Sampling Estimators
Given an unnormalized density g({theta}), an unbiased estimate of p(D | M) is given by the importance sampling formula


Formula 5

(5)
where Eg[c] is the expectation over g (Kass and Raftery, 1995). Using Monte Carlo procedures, a sample ({theta}k)k = 1..K can be drawn from g and used to approximate the expectations Eg[c]:


Formula 6

(6)

The simplest application of this method is to use the prior as the importance sampling distribution (g({theta}) = p({theta} | M)), in which case Eqs. (5) and (6) lead to the prior arithmetic mean estimator (AME):


Formula 7

(7)


Formula 8

(8)
However, a well-known problem with this estimator is that the high-likelihood region can be very small. Therefore, unless K is very large, the sample drawn from the prior will contain virtually no points from the high-likelihood region, resulting in a very poor estimate of p(D | M).

An alternative, proposed by Newton and Raftery (1994), is to draw from the posterior, rather than from the prior (g({theta}) = p(D | {theta}, M)). Intuitively, this should have the advantage of enriching the sample in points from the high-likelihood region. This results in the posterior harmonic mean estimator (HME):


Formula 9

(9)


Formula 10

(10)
The HME converges almost surely to the inverse of the marginal likelihood. However, in many practical situations, its variance is infinite. To circumvent this problem, Newton and Raftery (1994) proposed a third importance sampling scheme, called the stabilized harmonic mean estimator (SHME), based on a mixture of the prior and the posterior: g({theta}) = {delta} p({theta} | M) + (1 – {delta}) p({theta} | M). Typically, {delta} is chosen equal to 0.1.

General Principles of Thermodynamic Integration
This method, also called path sampling, is explained in greater details elsewhere (Gelman, 1998; Neal, 2000). Here, we give a slightly less formal introduction to its principles and show how it can be applied to phylogenetic problems.

Let us suppose that we have two unnormalized densities, q0({theta}) and q1({theta}), defined on the same parameter space {Theta}. The corresponding true probability densities are denoted by


Formula 11

(11)
where


Formula 12

(12)
are the normalization constants. Typically, in a Bayesian context, qi({theta}) = p(D | {theta}, Mi)p({theta} | Mi), Zi = p(D | Mi), and thus, pi({theta}) = p({theta} | D, Mi).

We wish to perform a numerical evaluation of the log-ratio


Formula 13

(13)


Formula 14

(14)
To do this, we define a continuous and differentiable path (qβ)0 ≤ β ≤ 1 in the space of unnormalized densities, joining q0 and q1. By extension, for any β, 0 ≤ β ≤ 1, pβ and Zβ are defined as


Formula 15

(15)


Formula 16

(16)
When β tends to 0 (resp. 1), pβ converges pointwise to p0 (resp. p1), and Zβ to Z0 (resp. Z1).

Taking the derivative of ln Zβ with respect to β:


Formula 17

(17)


Formula 18

(18)


Formula 19

(19)


Formula 20

(20)


Formula 21

(21)


Formula 22

(22)
where Eβ[c] stands for the expectation with respect to pβ. Defining the potential


Formula 23

(23)
one has thus the first moment identity:


Formula 24

(24)
Integrating over [0,1] yields the log-ratio one is looking for:


Formula 25

(25)


Formula 26

(26)


Formula 27

(27)

The key idea of thermodynamic integration is that, for any value of β between 0 and 1, one can run a Markov chain Monte Carlo in which qβ is used as the unnormalized density in the Metropolis-Hastings ratio. By definition, this yields a sample of parameter values drawn from the probability distribution pβ. Expectations over pβ can then be estimated as averages over this sample, which in particular allows one to evaluate Eβ[U]. This computation can be done for a series of values of β regularly spaced between 0 and 1, which implies running a Markov chain for each value of β (Fig. 1a, b). These sample expectations are finally used to approximate the integral over [0,1] (Eq. (25)) using Simpson's triangulation method (Fig. 1c).


Figure 1
View larger version (75K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Figure 1 Rationale of the thermodynamic integration method. a, A series of independent chains are run under different values of β, and for each of them, the mean posterior expectation of the potential U = {partial} ln qβ/{partial} β is computed (horizontal lines). b, These mean posterior expectations are plotted against β. c, The integral of the curve is estimated by the Simpson procedure. d, Illustration of the quasistatic version, in which β moves continuously from 0 to 1 during MCMC (see text for details).

 
Specifically, assuming a discretization step of {Delta} β = 1/C, with C an integer (e.g., C = 10), for each d = 0..C, we define βd = d x {Delta} β, and run a Markov chain having pβd as its stationary distribution. The resulting sample is denoted by


Formula 28

(28)
From that, Eβd[U] is estimated as


Formula 29

(29)
and by Simpson's triangulation, one gets the discrete thermodynamic estimate of µ = ln Z1 – ln Z0:


Formula 30

(30)

We used this discrete method previously (Lartillot and Philippe, 2004). In the present work, we also introduce a continuous (or quasistatic) version, which has the advantage of yielding a greater accuracy. The quasistatic method consists in equilibrating a MCMC under β = 0, then smoothly increasing the value of β, by adding a constant increment {delta} β after each series of Q cycles, until β = 1 is reached (Fig. 1d). During this procedure, points {theta}k are saved, for instance, before each update of β. Let us denote (βk, {theta}k)k = 0..K the series of points obtained in this way. One has in particular β0 = 0, βK = 1, and {forall} k 0 ≤ k < K, βk+1– βk = {delta} β. Then, the quasistatic estimate of ln Z1 – ln Z0 is given by:


Formula 31

(31)
Equivalently, one can start at β = 1, equilibrate the MCMC, and then progressively decrease β, while sampling along the path down to p0. We will make use of this bidirectional method below.

Many different schemes of thermodynamic integration can be devised, depending on the type of path considered. In the present work, we have used two main schemes.

Annealing-Melting Integration
This first scheme involves only one model (M) at a time, the path going from the prior to the unnormalized posterior defined by M. This path is defined as follows:


Formula 32

(32)
Note that q0({theta}) = p({theta} | M), and q1({theta}) = p(D | {theta}, M) p({theta} | M). The corresponding normalization constants are Z0 = 1 and Z1 = p(D | M). The integrand takes a simple form:


Formula 33

(33)


Formula 34

(34)
so that the integration defined above directly leads to an estimate of


Formula 35

(35)


Formula 36

(36)

In a thermodynamic perspective, the inverse of β can be seen as the equivalent of a temperature. Raising the likelihood to a power β < 1 is equivalent to smoothing out the likelihood surface, which will yield a "looser" Markov chain, more prone to accepting less likely parameter configurations. This is analogous to the behavior of a thermodynamic system, which has a higher probability of visiting high-energy microscopic configurations at higher temperature. Thus, slowly moving from β = 0 to β = 1 is equivalent to a quasistatic cooling down, or annealing, of the MCMC. Conversely, moving from β = 1 to β = 0 amounts to a warming up, or melting, of the MCMC. Note that the heated chains of the parallel Metropolis-coupled Markov chains (Altekar et al., 2004) are defined in a similar way (except that in their case, the posterior, rather than only the likelihood, is raised to the power β).

As an illustrating example, the annealing-melting integration method was applied to the PGK dataset, under a simple version of the rate-across-site (RAS) model, assuming a Poisson+F amino-acid replacement matrix, a prior mean branch length of {lambda} = 10, and {gamma}-distributed rates across sites, with {alpha} = 1, and no invariant sites. The topology was constrained to the posterior consensus. Figure 2 shows the evolution of ln p(D | {theta},RAS) as a function of β. The increment was {delta} β = ± 0.001 and was applied every Q = 10,000 cycles. The area situated between the curve and the zero axis is equal to the logarithm of the marginal likelihood under RAS and can be estimated using Eq. (31). In the present case, one obtains ln p(D |RAS) = –9922.1 natural log units (nits).


Figure 2
View larger version (15K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Figure 2 Annealing integration under the RAS model for the PGK dataset. The 1,000 values of ln p(D | {theta}, M) sampled during the quasistatic integration are plotted against β, shadowing the curve of Eβ [ln p(D | {theta}, M)] as a function of β. The logarithm of the marginal likelihood, as the integral of Eβ[ln p(D | {theta}, M)] over [0,1], can then be estimated as the integral of the curve.

 
Note that the curve shown in Figure 2 is strictly increasing. This can be shown theoretically: above, we have seen that the first derivative of ln Z was related to the expectation of U (Equation 24). By a similar argument, the second derivative of ln Z can be related to the variance of U:


Formula 37

(37)
where Vβ[U] = Eβ[U2] – Eβ2[U]. In the present case, U = ln p(D | {theta}) does not depend on β, so that this simplifies into


Formula 38

(38)
We will make further use of this identity when evaluating the error on the estimate.

The annealing method was applied on the same dataset, under the Poisson+F model and without rate variation across sites (UNI model), yielding an estimate of ln p(D | UNI) = –10,309.0 nits. The logarithm of Bayes' factor between the two models UNI and RAS is then obtained by taking the difference between the two estimated supports:


Formula 39

(39)
i.e., the PGK dataset gives a support of approximately 386.9 nits in favor of RAS over UNI.

Model-Switch Integration
As is clear in the previous example, the difference between the logarithm of the marginal likelihoods of the two models can be small compared to these two values. This can lead to poor estimates, unless the precision on each marginal likelihood is very high. For this reason, rather than performing two quasistatic moves from the prior to each of the two models' posterior distributions, it might be more convenient to make a single, and shorter, path directly connecting the two models in the space of unnormalized densities. This is the rationale of the model-switch method.

Suppose that we want to compare two models M0 and M1 that are defined on the same parameter space {Theta}. Note that this does not restrict the generality of the procedure, because parameters specific of, say, M0 can be included in the common parameter vector, but not be involved in the computation of the likelihood according to M1. The model-switch scheme involves a path that goes directly from model M0 to model M1:


Formula 40

(40)
For β = 0 or 1:


Formula 41

(41)


Formula 42

(42)


Formula 43

(43)


Formula 44

(44)
Therefore, in the present case, performing the thermodynamic integration leads to computing the logarithm of the Bayes' factor between the two models.

Differentiating pβ with respect to β yields


Formula 45

(45)


Formula 46

(46)

The Bayes' factor between the two models UNI and RAS was recomputed using this alternative integration scheme, using an increment {delta} β = 0.001, and every Q = 1,000. Figure 3 shows the collection of values of U({theta}) = ln p(D|{theta},RAS)–ln p(D|{theta},UNI) collected under a quasistatic transformation from UNI to RAS. In the present case, the two models have the same prior density, so that we only need to compute the difference between their log-likelihoods. In graphical terms, the logarithm of the Bayes factor is equal to the algebraic area situated between the curve and the x-axis, and as before, can be estimated by averaging over the sample (Eq. (31)). In the present case, it yields an estimate of 386.7 nits, very close to our previous estimate obtained with the annealing-melting method.


Figure 3
View larger version (18K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Figure 3 Model-switch integration between the uniform-rate (UNI) and the rate-across-site (RAS) models for the PGK dataset. The 1,000 values of ln p(D|{theta},RAS)–ln p(D|{theta},UNI) sampled during the model-switch integration are plotted against β. The logarithm of the Bayes' factor can be estimated as the algebraic area between the curve and the x-axis.

 
Note that, as in the case of the annealing-melting scheme, the function being integrated is monotonous (its derivative being equal to Vβ [ln p(D|{theta},RAS)–ln p(D|{theta},UNI)]).

Error Estimation
We will consider the discrete and the quasistatic procedures separately. For the discrete estimator, two main sources of errors have to be considered: the sampling error and the error induced by the discretization. First, the sampling variance is equal to


Formula 47

(47)
where


Formula 48

(48)
and Keff is the effective sample size. The corresponding standard error is {sigma}s = Formula . As for the discretization error, because Eβ[U] is a monotonous function of β, the worst-case upper (resp. lower) error is given by the area between the piecewise continuous function joining the measured values of Eβ[U] and the upper (resp. lower) step function built from them. Both areas are equal to:


Formula 49

(49)
The total error can then be estimated as the worst-case discretization error, combined with the 95% confidence interval of the sampling error: {sigma} = {sigma}d + 1.645 {sigma}s.

Concerning the quasistatic estimator, the discretization error is again equal to


Formula 50

(50)
Assuming independence between the successive points of the chain, the sampling variance is given by


Formula 51

(51)


Formula 52

(52)


Formula 53

(53)
Using the second moment identity (Eq. (38)), the second term of Eq. (53) can be reformulated as


Formula 54

(54)


Formula 55

(55)


Formula 56

(56)
so that the sampling variance of the quasistatic estimate is


Formula 57

(57)
and the associated standard error is equal to {sigma}s = Formula . As in the case of the discretized estimate, the total error is {sigma} = {sigma}d + 1.645 {sigma}s.

All this is valid only if the points are truly independent draws from their respective pβ distributions. If this is not the case, then a factor {tau} = K/Keff (i.e., the decorrelation time) has to be put in front of the right-hand side of Eq. (57), to account for the effective sample size. Here, the situation is slightly more complicated due to the fact that, as β moves from 0 to 1, the decorrelation time of the chain might change. In general, we did not observe large variations of the decorrelation time for different valus of β. In practice, we compute the decorrelation time at β = 0 ({tau}0) and β = 1 ({tau}1) and take the larger of the two.

In addition, because β changes continuously during sampling, the chain is never exactly at equilibrium, and this will cause a "thermic lag" of the MCMC: when sampling a value of {theta} at the current value of β, one is in effect sampling from pβ', with β' slightly smaller than β. Because U({theta}) is an increasing function of β, one expects this lag to result on average in an underestimation of µ when going from 0 to 1. In contrast, performing a quasistatic move from 1 to 0 will lead to an overestimation of µ. These under- and overestimations obtained by performing a bidirectional estimation are interesting, because they allow us to bracket the true value. Specifically, each direction yields a confidence interval of the form [µ – {sigma}, µ + {sigma}]. In the present article, we will always display these two intervals together, but they could as well be merged into a definitive confidence interval (i.e., the smallest interval of R containing them). This will account for worst case errors due to thermic lag and discretization, as well as the 95% level confidence related to the sampling error. In principle, there is less than 5% of chances that the true value lies outside. In practice, however, when the discretization error or the thermic lag dominate the sampling error, the true risk is much lower.

In summary, we propose the following overall procedure, allowing to estimate all sources of errors for the quasistatic method:

  1. Equilibrate MCMC at β = 0, and obtain a first sample of K1 points, on which to estimate E0[U], V0[U] and {tau}0;
  2. Perform the quasistatic sampling, as explained above, moving β progressively from 0 to 1;
  3. Once β = 1, perform an additional series of K1 steps, to evaluate E1[U], V1[U] and {tau}1.
An estimate of µ is obtained at step 2 (which we call Formula to mean that it potentially underestimates µ because of the thermic lag). The corresponding discretization and sampling errors can be computed from steps 1 and 3, and combined into {sigma}. Doing the same sampling procedure from 1 to 0 yields another estimate Formula +, with an error of {sigma}+. Finally, the two estimates and their respective errors can be combined together, as explained above.

To illustrate the interplay between the different sources of errors, we applied both the discrete and the quasistatic estimators to the evaluation of the Bayes' factor between the two models UNI and RAS. First, the discrete method was applied, using C = 10 intervals across [0, 1]. This implies running 11 chains, each of which was run for a total of 110,000 cycles, including a burn-in of 10,000 cycles, and saving 1 point every 100 cycles. The estimated decorrelation time of the chains varied between 1 and 2.6 saved points (or equivalently, between 100 and 260 cycles). The logarithm of the Bayes' factor was estimated at 379.3 nits, with a total error of 44.2 nits. Not surprisingly, the discretization error is dominant in this case ({sigma}d = 43.0), whereas the sampling error is small ({sigma}s = 0.74).

Next, we applied the bidirectional model-switch quasistatic method, under several values of Q and {delta} β (Table 1). In each case, the two separate confidence intervals are indicated, together with the estimated discretization and sampling errors, and the decorrelation times. The discretization error is much smaller than with the discrete method: it is of the same order of magnitude as the sampling error for {delta} β = 0.01, and negligible (less than a natural unit of logarithm) under {delta} β = 0.001. The sampling error also decreases with {delta} β, as expected. As for the thermic lag, it manifests itself by the fact that the two intervals are shifted with respect to each other (except when Q > 1,000, but then, the observed shift is within sampling error). In all cases, the two intervals are overlapping, except when {delta} β = 0.001 and Q = 10. Note, however, that in the latter case, the combined interval encompasses all other intervals obtained under more stringent settings. The quasistatic method can thus work under two regimes: either the thermic lag is negligible, in which case the two estimates obtained by the bidirectional method are congruent within sampling error, or it is dominant, and then, what one obtains is modulo sampling error, a bracketing of the true value.


View this table:
[in this window]
[in a new window]

 
Table 1 Bayes' factor between the RAS and the UNI models for the PGK dataset: precision of the model-switch estimate as a function of {delta} β and Q. For each condition, a bidirectional model-switch integration is performed and the total confidence interval is evaluated as indicated in the text. The discretization ({sigma}d) and sampling ({sigma}s) errors are reported, as well as the estimated decorrelation time of the chain (in each case, the largest value among the two directions is indicated).

 

    Comparing Importance Sampling and Thermodynamic Integration
 Top
 Abstract
 Data and Models
 Marginal Likelihood Estimation
 Comparing Importance Sampling...
 Application: Models of Amino...
 Discussion
 References
 
Technically, the estimation of the marginal likelihood of a model amounts to the numerical evaluation of an integral. Therefore, a simple way of validating an estimation method consists in applying it to cases where the integral can be computed in a closed form. This estimate can then be compared with the analytical value.

A Gaussian Model
We first considered a simple model, parameterized by a vector x = (x1,x2,c,xd) of dimension d. The prior on x is a product of independent normals on each xi, i = 1... d, of mean 0 and variance 1. The likelihood is


Formula 58

(58)
where v is a hyperparameter. The posterior is then also a product of independent normal distributions, with mean 0 and variance v/(1 + v), and the log of the Bayes' factor is d [ln (v) – ln (1 + v)]/2. The prior, the posterior, as well as the posterior's β-heated form, are all Gaussian, and sampling independent values of x from them is straightforward. The importance sampling estimators (HME, SHME, and AME) and the annealing-melting thermodynamic integration methods can therefore all be applied directly.

As shown in Table 2, for v = 1 and d = 1, all methods perform reasonably well, with a relative error less than 0.1% for samples of 106 points. However, still in the univariate case (d = 1), when the variance of the likelihood is small compared to that of the prior (v = 0.01), the HME is not reliable, even for large samples. More precisely, it systematically overestimates the marginal likelihood. The stabilized version does better but, in fact, even the primitive AME performs a correct estimation in this case. The thermodynamic method also yields a reliable estimate. Finally, under both high dimension (d = 100) and small variance (v = 0.01), all three importance sampling methods fail, whereas thermodynamic integration remains well-behaved. Note that the HME and SHME lead to a systematic overestimation, and the AME to an underestimation.


View this table:
[in this window]
[in a new window]

 
Table 2 Logarithm of the marginal likelihood of a Gaussian model, evaluated by the harmonic mean estimator (HME), its stabilized version (SHME, using {delta} = 0.1), the prior arithmetic mean estimator (AME), and the annealing thermointegration procedure. For each setting, the mean and standard error of 10 independent estimations are displayed.

 
Evaluating the Averaged Likelihood of a Tree
In order to evaluate the reliability of these alternative methods in a phylogenetic context, one would need to find a model for which analytical integration is possible, which is in general not the case. However, there is a very common situation, where an integral (in fact, a sum), is performed analytically: the classical likelihood, evaluated at a given site, is a sum over the 20P–3 possible configurations specifying the amino-acid state at each internal node of the tree (for that reason, it is sometimes called the averaged likelihood). Denoting such a configuration by s:


Formula 59

(59)
where Ci stands for the ith column of the alignment. Usually, this summation is done by dynamic programming, using the well-known "pruning" algorithm (Felsenstein, 1981), but we can also perform this summation using the harmonic or thermodynamic methods, and compare the resulting estimates with the true value, obtained by pruning.

To compute the HME, we have to be able to sample values of s, according to p(s| {theta}, M), which we can do using the algorithm proposed by Nielsen (Nielsen, 2001). As for the thermodynamic integration, a straightforward generalization of this algorithm (obtained by replacing all instances of p(s| {theta}, M) by p(s| {theta}, M)β in the computations) allows one to sample as well from the heated distribution p(s| {theta}, M)β for any β isin [0,1]. We applied this rationale to the evaluation of the integrated likelihood under the PGK dataset. Specifically, we computed the likelihood of the marginal posterior consensus tree, assuming a simple model with uniform rates across sites, and a Poisson process of amino-acid replacement, with uniform stationary probability vector (Table 3). Here again, whereas the thermodynamic integration method yields estimates close to the true value, the HME is not reliable, even when samples of 105 independent points are used.


View this table:
[in this window]
[in a new window]

 
Table 3 Logarithm of the integrated likelihood of the mean posterior consensus topology, for the PGK dataset. For each setting, the mean and standard error of 10 independent estimations are displayed.

 

    Application: Models of Amino-Acid Replacement
 Top
 Abstract
 Data and Models
 Marginal Likelihood Estimation
 Comparing Importance Sampling...
 Application: Models of Amino...
 Discussion
 References
 
Finally, we applied the HME and thermodynamic integration to the comparison of alternative models of amino-acid replacement (Tables 4, 5, 6). Concerning thermodynamic integration, we used both the discretized method, with C = 10, and the quasistatic model-switch schemes (with {delta} β = 0.01 and Q = 100). For the HME, we ran chains of 1,100,000 cycles, saving one point every 100 cycles. The effective sample sizes ranged from 150 to 2000.


View this table:
[in this window]
[in a new window]

 
Table 4 Logarithm of the marginal likelihood of alternative amino-acid replacement models evaluated on the PGK dataset, by model-switch (MS, the two confidence intervals are reported), discrete (DS) thermointegration, and harmonic mean estimation (HME). The three alternative sets' priors over the hyperparameters (P1, P2, P3, see Data and Models) were considered (P1 is the default prior). All evaluations were performed on a predefined topology, except one bidirectional quasistatic run, performed under free topology (FT). Poisson is taken as the reference model. Highest scores are indicated in bold face.

 


View this table:
[in this window]
[in a new window]

 
Table 5 Logarithm of the marginal likelihood of alternative amino-acid replacement models evaluated on the EF dataset, by model-switch (MS) discrete (DS) thermointegration and harmonic mean estimation (HME), under fixed (default) or free (FT) topology. Poisson is taken as the reference model. Highest scores are indicated in bold face.

 


View this table:
[in this window]
[in a new window]

 
Table 6 Logarithm of the marginal likelihood of alternative amino-acid replacement models evaluated on the POL, DLIG, and UVR datasets, by model-switch (MS) thermointegration and harmonic mean estimation (HME). Poisson is taken as the reference model. Highest scores are indicated in bold face.

 
Bayes' factors are known to be sensitive to the choice of the prior. In the present cases, we used by default exponential priors of mean 1 on the two hyperparameters tuning the mean of the prior on the branch lengths ({lambda}) and the variance of the rate distribution ({alpha}). However, to measure the impact of the choice of the prior, we also tried two alternative sets of priors on the PGK dataset (see Data and Models). This did not fundamentally change the results (Table 4), indicating that Bayes' factors are robust to the choice of the prior on these parameters.

According to the results obtained by model-switch thermodynamic integration, for all the investigated datasets, the empirical matrix WAG is much better than Poisson (Tables 4, 5, 6). In addition, considering the stationary probabilities as free parameters (WAG+F) yields a better fit than fixing them to their default values (WAG), a fact that was also observed by the authors of the WAG matrix, by a likelihood ratio test (Whelan and Goldman, 2001). The general reversible model, GTR is in general better than WAG+F, except for the smaller dataset, PGK. In the case of POL, the confidence intervals obtained for the two models, GTR and WAG+F, are overlapping, and it is thus not clear which model has a better fit. Finally, in all cases, MAX is worse than all models but Poisson. In contrast, the fit of CAT is dataset dependent, as it performs better than WAG on the DLIG and EF alignments, but not on POL, nor on PGK. In the case of UVR, there is again a slight overlap between CAT and GTR.

All these Bayes' factors were computed under a fixed topology, constrained according to external criteria (see Data and Models). However, as shown for the PGK and the EF datasets, the ordering of the models is totally identical when averaging over topologies (Tables 4, 5).

In general, the discrete version of thermodynamic integration yields estimates consistent with those computed using the model-switch method (Tables 4, 5), but with a much greater uncertainty. In most cases, the confidence intervals obtained for the alternative models are widely overlapping, which makes it difficult to decide which model is best. In contrast, the estimates obtained by the HME are incongruent with those computed using thermodynamic integration (Tables 4 to 6), except when comparing models having a similar number of parameters (such as WAG+F and Poisson+F). When comparing models of widely differing dimensionality, however, the estimated Bayes' factors are so different that the two methods even differ in their conclusions. For instance, the HME always gives CAT as the best model, whereas thermodynamic integration sometimes favors WAG (PGK) or GTR (POL).


    Discussion
 Top
 Abstract
 Data and Models
 Marginal Likelihood Estimation
 Comparing Importance Sampling...
 Application: Models of Amino...
 Discussion
 References
 
Reliability of Marginal Likelihood Estimators
In this work, we have applied two main alternative methods of Bayes' factor evaluation: the harmonic mean estimator (HME) and thermodynamic integration. Our comparative analysis shows a striking discrepancy between them, and comparisons with true values that can be analytically computed, in the case of the normal model (Table 2) or in the context of pruning (Table 3), indicate that this is due to a lack of reliability of the HME. At the same time, these comparisons provide a validation of our implementation of the method of thermodynamic integration.

To see why the HME is misleading, we can rely on the following intuitive reasoning. Supposing, for simplicity, that the likelihood is unimodal, the marginal likelihood is more or less the product of two factors: the likelihood reached in the high-likelihood region (the mode height) and the relative size of this region (the mode width). This latter factor, the mode width, is more precisely defined as the ratio of the size of the region under the posterior mode to the overall size of the parameter space. It acts as an Ockham factor (Jaynes, 2003), as it will be smaller for more ad hoc models, which reach a significant likelihood only under very specific values of the parameters. Note also that, in general, it will tend to be smaller for higher dimensional models.

For an estimator such as the HME to work, it has to retrieve reliable information about both the mode height and the mode width from a posterior sample. Concerning the mode height, the value of the likelihood reached at equilibrium is a good indication. As for the mode width, the only way to extract information about it is by measuring the relative frequency at which points of the sample fall inside and outside the mode. However, obtaining reliable estimates of this frequency requires that a sufficient number of points outside the mode be included in our sample. Yet, in practice, the contrast between the low and the high likelihoods is in general so great that even a posterior sample of astronomical size will be virtually confined within the mode. The estimated frequency at which the low-likelihood region is visited is then 0, which means that, in effect, the HME behaves as if the mode was occupying the entire parameter space (Ockham factor = 1), and therefore, completely underestimates the dimensional penalty.

As a result, the HME overestimates p(D | M). Furthermore, this overestimation will be more pronounced in the case of higher dimensional models, for which the Ockham factor is smaller, which implies that the harmonic estimator will be effectively biased in favor of such models. This is, in fact, exactly what we see when comparing models of amino-acid replacement (Tables 4, 5, 6): whereas the HME yields a more or less correct value of the Bayes' factor between models of equivalent dimensions (i.e., Poisson versus WAG), it completely reverses the conclusions when comparing models of widely differing dimensionality, such as WAG versus CAT. This is particularly striking in the case of MAX, the most parameter-rich model, for which the error is more than fourfold on the logarithm scale.

The fact that the HME systematically overestimates the marginal likelihood may well explain a few odd results obtained recently. First, it was observed that Bayes' factors tend to support higher dimensional models in a too systematic way, to the point that it was concluded that Bayes' factors may not "strike a reasonable balance between model complexity and error in parameter estimates" (Nylander et al., 2004). Second, and more disturbingly, in simulation studies, Bayes' factors seemed to favor models more complex than the actual model used to simulate the data (Pagel and Meade, 2004). Given what we have shown above, these outcomes could also be due, not to a fundamental lack of reliability of the Bayes' factor, but instead to the systematic distortion of the HME in favor of more complex models.

Our analysis stresses the importance of using more robust and well-validated methods for Bayes' factor evaluations. Neither the HME nor its stabilized version fall into this category. We have also tested other estimators based on the importance sampling principle (Geyer, 1994; Meng and Wong, 1996), in particular the estimator proposed by Meng and Wong, which can be shown to be optimal in its category (that includes the HME and the AME). Yet none of them gave reliable results (not shown). More generally, our experience is that importance sampling estimators do not work well on large datasets.

As a general alternative, we propose to employ thermodynamic integration. This method is certainly not straightforward. It is theoretically quite involved, requires additional code-writing for sampling along paths in the space of distributions, and, furthermore, is computationally intensive. According to our experience, as a rule of thumb, thermodynamic integration will require about 10 times more CPU time than a plain posterior sampling under the more demanding among the two models being compared. In general, this means running a chain for several days, up to several weeks for models like CAT, for which mixing is more challenging. On the other hand, it seems to be more reliable. In the present work, it has led to correct estimates in the two cases in which we can compute the corresponding integral in a closed form. In addition, it has better theoretical properties (Gelman, 1998). In particular, its variance is well within control: as can be seen from our error estimation, the variance is at most quadratic in the logarithm of the likelihood of the dataset, which is itself linear in the size of the alignment. Hence, using more complicated methods, such as thermodynamic integration, seems to be the price to pay for correctly evaluating high-dimensional integrals such as the marginal likelihood. Thus far, no other method of equivalent precision and generality is yet available, although some ideas have been proposed (Chib, 1995; Chib and Jeliazkov, 2001), which we are currently exploring.

Our comparison of alternative models of amino-acid replacement confirms and extends what we have presented previously (Lartillot and Philippe, 2004), except in one case: we previously reported that GTR was less fit than WAG+F on the EF dataset, whereas we now find that GTR is in fact better than WAG+F (Table 5). As we can check by the error analysis developed in the present article, this is due to the lack of precision of the discrete version. The quasistatic method thus appears to be much more precise than the discrete version, all the more so as the error can be controlled with great flexibility (Table 1).

The disrete and quasistatic schemes that we have introduced here are not the only possible approaches to thermointegration, however. For instance, an alternative method consists in simulating the joint distribution on (β,{theta}) (Gelman, 1998). This has the advantage of eliminating both the thermic lag and the discretization error. In the applications presented in this article, the thermic lag and the discretization error are not too problematic, thanks to the monotony of the integrand. However, there are many other potentially interesting paths, not all of which have this monotony property. On the other hand, simulating from the joint distribution on (β, {theta}) also entails some practical difficulties. In any case, the two approaches need to be compared in practice.

Otherwise, two major conclusions can be drawn from these comparisons. First, in the cases investigated here, the ordering of the models is the same, whether the topology is fixed to that obtained under the standard model (WAG+F, I+{Gamma}), or whether it is averaged away. This confirms that model comparisons seem to be robust to the choice of the topology, as long as this topology is reasonable (Posada and Crandall, 2001). Nevertheless, we do not think that this should be considered as a generalizable rule. In particular, this might not hold anymore if uncertainty is high in the tree, or if each model strongly supports a distinct phylogeny. In such situations, it may be more reasonable to average over topologies, at least if CPU requirements are not limiting.

Second, accounting for pattern heterogeneity across sites by a mixture model results in a better fit in the majority of the cases, although, importantly, some datasets give a greater support for simpler models, like GTR or WAG+F for POL, or WAG for PGK. This could mean that the Dirichlet process requires alignments larger than those investigated here to correctly learn its parameters. Alternatively, it could be due to our approximation consisting in considering only mixtures of Poisson processes, instead of more general mixtures. A different mixture model was proposed recently in which, in contrast to CAT, the stationary probabilities are set equal across the mixture, whereas the relative exchangeability parameters of the matrix are category specific (Pagel and Meade, 2004). Obviously, a combination of the two, i.e., a general mixture of GTR matrices, should be tested.

More generally, many other models of protein evolution can be imagined, allowing for diverse kinds of heterogeneity across sites, or across lineages, which raises a problem of how to choose among all these possibilities. In this article, we have proposed a general method for this purpose. This method can be used as a guide, allowing one to progressively focus on better models of molecular evolution.


    Acknowledgements
 
We wish to thank Henner Brinkmann for making available the aligned sequences on which this work was based. We are also grateful to Nicolas Rodrigue, David Bryant, Olivier Gascuel, Thomas Lepage, and the two referees for their useful comments on the work and on the manuscript. This work was funded by the Centre National de la Recherche Scientifique, the French funding program ACI IMP-BIO "Model Phylo," and the 60th "Comission Mixte Permanente Franco-Québécoise."


    References
 Top
 Abstract
 Data and Models
 Marginal Likelihood Estimation
 Comparing Importance Sampling...
 Application: Models of Amino...
 Discussion
 References
 

    Aitkin M. Posterior Bayes factors. J. R. Stat. Soc. B (1991) 53:111–142.

    Altekar G., Dwarkadas S., Huelsenbeck J., Ronquist F. Parallel Metropolis coupled Markov chain Monte Carlo for Bayesian phylogenetic inference. Bioinformatics (2004) 20:407–415.[Abstract/Free Full Text]

    Berger J., Pericchi L. The intrinsic Bayes factor for model selection and prediction. J. Am. Stat. Assoc. (1996) 91:109–122.[CrossRef][Web of Science]

    Bollback J. P. Bayesian model adequacy and choice in phylogenetics. Mol. Biol. Evol. (2002) 19:1171–1180.[Abstract/Free Full Text]

    Brinkmann H., vander Giezen M., Zhou Y., Poncelin de Raucourt G., Philippe H. An empirical assessment of long-branch attraction artefacts in deep eukaryotic phylogenomics. Syst. Biol. (2005) 54:743–757.[Abstract/Free Full Text]

    Castresana J. Selection of conserved blocks from multiple alignment for their use in phylogenetic analysis. Mol. Biol. Evol. (2000) 17:540–552.[Abstract/Free Full Text]

    Chib S. Marginal likelihood from the Gibbs output. J. Am. Stat. Assoc. (1995) 90:1313–1321.[CrossRef][Web of Science]

    Chib S., Jeliazkov I. Marginal likelihood from the Metropolis-Hastings output. J. Am. Stat. Assoc. (2001) 96:270–281.[CrossRef][Web of Science]

    Felsenstein J. Evolutionary trees from DNA sequences: A maximum likelihood approach. J. Mol. Evol. (1981) 17:368–376.[CrossRef][Web of Science][Medline]

    Gaut B. S., Lewis P. O. Success of the maximum likelihood phylogeny inference in the four taxon case. Mol. Biol. Evol. (1995) 12:152–162.[Abstract]

    Gelman A. Simulating normalizing constants: From importance sampling to bridge sampling to path sampling. Stat. Sci. (1998) 13:163–185.[CrossRef][Web of Science]

    Gelman A., Carlin J. B., Stern H. S., Rubin D. B. Bayesian data analysis (2004) London: Chapman and Hall/CRC.

    Gelman A., Meng X. L., Stern H. Posterior predicive assessment of model fitness via realised discrepancies. Stat. Sinica (1996) 6:733–807.

    Geyer C. J. Practical Markov chain Monte Carlo. Stat. Sci. (1992) 7:473–483.[CrossRef]

    Geyer C. J. Estimating normalizing constants and reweighting mixtures in Markov chain Monte Carlo (1994) University of Minnesota. Technical report 568, school of statistics.

    Green P. J. Reversible jump Markov chain Monte Carlo computation and Bayesian model determination. Biometrika (1995) 82:711–732.[Abstract/Free Full Text]

    Han C., Carlin B. P. MCMC methods for computing Bayes factors: A comparative review. Biometrika (2000) 82:711–732.

    Holder M., Lewis P. O. Phylogenetic estimation: Traditional and Bayesian approaches. Nat. Rev. Genet. (2003) 4:275–284.[CrossRef][Web of Science][Medline]

    Huelsenbeck J. P., Larget B., Alfaro M. E. Bayesian phylogenetic model selection using reversible jump Markov chain Monte Carlo. Mol. Biol. Evol. (2004) 21:1123–1133.[Abstract/Free Full Text]

    Huelsenbeck J. P., Larget B., Miller R. E., Ronquist F. Potential applications and pitfalls of Bayesian inference of phylogeny. Syst. Biol. (2002) 51:673–688.[Abstract/Free Full Text]

    Huelsenbeck J. P., Ronquist F. MrBayes: Bayesian inference of phylogenetic trees. Bioinformatics (2001) 17:754–755.[Abstract/Free Full Text]

    Irestedt M., Fjeldsa J., Nylander J. A., Ericson P. G. Phylogenetic relationships of typical antbirds (Thamnophilidae) and test of incongruence based on Bayes factors. BMC Evol. Biol. (2004) 4:23.[CrossRef][Medline]

    Jaynes E. Probability theory. In: The logic of science (2003) Cambridge, UK: Cambridge University Press.

    Jeffreys H. Some tests of significance, treated by the theory of probability. Proc. Camb. Phil. Soc. (1935) 31:203–222.[CrossRef]

    Jones D. T., Taylor W. R., Thornton J. M. The rapid generation of mutation data matrices from protein sequences. CABIOS (1992) 8:275–282.[Medline]

    Kass R., Raftery A. Bayes factors and model uncertainty. J. Am. Stat. Assoc. (1995) 90:773–795.[CrossRef][Web of Science]

    Larget B., Simon D. Markov chain Monte Carlo algorithms for the Bayesian analysis of phylogenetic trees. Mol. Biol. Evol. (1999) 16:750–759.[Web of Science]

    Lartillot N., Philippe H. A Bayesian mixture model for across-site heterogeneities in the amino-acid replacement process. Mol. Biol. Evol. (2004) 21:1095–1109.[Abstract/Free Full Text]

    Meng X. L. Posterior predictive p-values. Ann. Stat. (1994) 22:1142–1160.[CrossRef][Web of Science]

    Meng X. L., Wong W. H. Simulating ratios of normalising constants via a simple identity: A theoretical exploration. Stat. Sinica (1996) 6:831–860.

    Minin V., Abdo Z., Joyce P., Sullivan J. Performance-based selection of likelihood models for phylogeny estimation. Syst. Biol. (2003) 52:674–683.[Abstract/Free Full Text]

    Neal R. M. Markov chain sampling methods for Dirichlet process mixture models. J. Comput. Graph. Stat. (2000) 9:249–265.[CrossRef][Web of Science]

    Newton M. A., Raftery A. E. Approximating Bayesian inference with the weigthed likelihood bootstrap. J. R. Stat. Soc. B (1994) 56:3–48.

    Nielsen R. Mapping mutations on phylogenies. Syst. Biol. (2001) 51:729–739.[CrossRef][Web of Science]

    Nylander J. A. A., Ronquist F., Huelsenbeck J. P., Nieves-Aldrey J. L. Bayesian phylogenetic analysis of combined data. Syst. Biol. (2004) 53:47–67.[Abstract/Free Full Text]

    Ogata Y. A Monte Carlo method for high dimensional integration. Num. Math. (1989) 55:137–157.[CrossRef]

    O'Hagan A. Fractional Bayes factors for model comparison. J. R. Stat. Soc. B (1995) 57:99–138.

    Pagel M., Meade A. A phylogenetic mixture model for detecting pattern-heterogeneity in gene sequence or character-state data. Syst. Biol. (2004) 53:561–581.

    Philippe H. MUST, a computer package of management utilities for sequences and trees. Nucleic Acid Res. (1993) 21:5264–5272.[Abstract/Free Full Text]

    Philippe H., Lartillot N., Brinkmann H. Multigene analyses of bilaterian animals corroborate the monophyly of Ecysozoa, Lophotrochozoa and Protostomia. Mol. Biol. Evol. (2005) 22:1246–1253.[Abstract/Free Full Text]

    Posada D., Crandall K. Selecting the best-fit model of nucleotide substitution. Syst. Biol. (2001) 50:580–601.[Abstract/Free Full Text]

    Raftery A. E., Lewis S. M. [Practical Markov chain Monte Carlo]: Comment: One long run with diagnostics: Implementation strategies for Markov chain Monte Carlo. Stat. Sci. (1992) 7:493–497.[CrossRef]

    Rannala B. Identifiability of parameters in MCMC Bayesian inference of phylogeny. Syst. Biol. (2002) 51:754–760.[Abstract/Free Full Text]

    Rubin D. B. Bayesianly justifiable and relevant frequency calculations for the applied statistician. Ann. Stat. (1984) 4:1151–1172.

    Schwartz G. Estimating the dimension of a model. Ann. Stat. (1978) 6:461–464.[CrossRef][Web of Science]

    Stefanovic S., Rice D., Palmer J. Long branch attraction, taxon sampling, and the earliest angiosperms: Amborella or monocots? BMC Evol. Biol. (2004) 4:35.[CrossRef][Medline]

    Stone M. Cross-validatory choice and assessment of statistical predictions. J. R. Stat. Soc. B (1974) 36:111–147.

    Suchard M., Kitchen C. M. R., Sinsheimer J., Weiss R. E. Hierarchical phylogenetic models for analyzing multipartite sequence data. Syst. Biol. (2003) 52:649–664.[Abstract/Free Full Text]

    Suchard M., Weiss R., Sinsheimer J. Bayesian selection of continuous-time Markov chain evolutionary models. Mol. Biol. Evol. (2001) 18:1001–1013.[Abstract/Free Full Text]

    Sullivan J., Swofford D. L. Are guinea pigs rodents? The importance of adequate models in molecular phylogenetics. J. Mammal. Evol. (1997) 4:77–86.[CrossRef]

    Thompson J., Higgins D., Gibson T. CLUSTAL W: Improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. Nucleic Acids Res. (1994) 22:4673–4680.[Abstract/Free Full Text]

    Verdinelli I., Wasserman L. Computing Bayes factors using a generalization of the Savage-Dickey density ratio. J. Am. Stat. Assoc. (1995) 90:614–618.[CrossRef][Web of Science]

    Waddell P. J., Kishino H., Ota R. Very fast algorithms for evaluating the stability of ML and Bayesian phylogenetic trees from sequence data. Genome Inform. (2002) 13:82–92.[Medline]

    Whelan S., Goldman N. A general empirical model of protein evolution derived from multiple protein families using a maximum-likelihood approach. Mol. Biol. Evol. (2001) 18:691–699.[Abstract/Free Full Text]

    Yang Z. Maximum-likelihood estimation of phylogeny from DNA sequences when substitution rates differ over sites. Mol. Biol. Evol. (1993) 10:1396–1401.[Abstract]

    Yang Z. Maximum likelihood phylogenetic estimation from DNA sequences with variable rates over sites: Approximate methods. J. Mol. Evol. (1994) 39:306–314.[CrossRef][Web of Science][Medline]

    Yang Z. Among site variation and its impact on phylogenetic analyses. Trends Ecol. Evol. (1996) 11:367–370.[CrossRef]


Add to CiteULike CiteULike   Add to Connotea Connotea   Add to Del.icio.us Del.icio.us    What's this?


This article has been cited by other articles:


Home page
Mol Biol EvolHome page
N. Rodrigue, C. L. Kleinman, H. Philippe, and N. Lartillot
Computational Methods for Evaluating Phylogenetic Models of Coding Sequence Evolution with Dependence between Codons
Mol. Biol. Evol., July 1, 2009; 26(7): 1663 - 1676.
[Abstract] [Full Text] [PDF]


Home page
Mol Biol EvolHome page
M. Anisimova and C. Kosiol
Investigating Protein-Coding Sequence Evolution with Probabilistic Codon Substitution Models
Mol. Biol. Evol., February 1, 2009; 26(2): 255 - 271.
[Abstract] [Full Text] [PDF]


Home page
Syst BiolHome page
G. Baele, Y. Van de Peer, and S. Vansteelandt
A Model-Based Approach to Study Nearest-Neighbor Influences Reveals Complex Substitution Patterns in Non-coding Sequences
Syst Biol, October 1, 2008; 57(5): 675 - 692.
[Abstract] [Full Text] [PDF]


Home page
Syst BiolHome page
J. A. Clarke and K. M. Middleton
Mosaicism, Modules, and the Evolution of Birds: Results from a Bayesian Approach to the Study of Morphological Evolution Using Discrete Character Data
Syst Biol, April 1, 2008; 57(2): 185 - 201.
[Abstract] [Full Text] [PDF]


Home page
Syst BiolHome page
J. P. Huelsenbeck and M. A. Suchard
A Nonparametric Method for Accommodating and Testing Across-Site Rate Variation
Syst Biol, December 1, 2007; 56(6): 975 - 987.
[Abstract] [Full Text] [PDF]


Home page
Syst BiolHome page
N. Rodrigue, H. Philippe, and N. Lartillot
Exploring Fast Computational Strategies for Probabilistic Phylogenetic Analysis
Syst Biol, October 1, 2007; 56(5): 711 - 726.
[Abstract] [Full Text] [PDF]


Home page
Syst BiolHome page
J. A. McGuire, C. C. Witt, D. L. Altshuler, and J. V. Remsen
Phylogenetic Systematics and Biogeography of Hummingbirds: Bayesian and Maximum Likelihood Analyses of Partitioned Data and Selection of an Appropriate Partitioning Strategy
Syst Biol, October 1, 2007; 56(5): 837 - 856.
[Abstract] [Full Text] [PDF]


Home page
Syst BiolHome page
J. M. Brown and A. R. Lemmon
The Importance of Data Partitioning and the Utility of Bayes Factors in Bayesian Phylogenetics
Syst Biol, August 1, 2007; 56(4): 643 - 655.
[Abstract] [Full Text] [PDF]


This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (PDF) Freely available
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in ISI Web of Science
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrow Search for citing articles in:
ISI Web of Science (31)
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Lartillot, N.
Right arrow Articles by Philippe, H.
Right arrow Search for Related Content
PubMed
Right arrow Articles by Lartillot, N.
Right arrow Articles by Philippe, H.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?