Skip Navigation

Systematic Biology 2007 56(5):711-726; doi:10.1080/10635150701611258
This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (PDF) Freely available
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in ISI Web of Science
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrow Search for citing articles in:
ISI Web of Science (3)
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Rodrigue, N.
Right arrow Articles by Lartillot, N.
Right arrow Search for Related Content
PubMed
Right arrow Articles by Rodrigue, N.
Right arrow Articles by Lartillot, N.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?

© 2007 Society of Systematic Biologists

Exploring Fast Computational Strategies for Probabilistic Phylogenetic Analysis

Edited by Paul Lewis

Nicolas Rodrigue1, Hervé Philippe1 and Nicolas Lartillot2

1 Canadian Institute for Advanced Research, Département de Biochimie, Université de Montréal C.P. 6821, Succ. Centre-ville, Montréal, Québec, H3C 3J7, Canada E-mail: nicolas.rodrigue{at}umontreal.ca
2 Laboratoire d'Informatique, de Robotique et de Microélectronique de Montpellier, URM 5506, CNRS-Université de Montpellier 2 Montpellier, France


    Abstract
 Top
 Abstract
 Models
 Latent State Methods
 Normal Approximation Methods
 Data
 Results and Discussion
 Future Directions
 Appendix 1
 Appendix 2
 Acknowledgments
 References
 
In recent years, the advent of Markov chain Monte Carlo (MCMC) techniques, coupled with modern computational capabilities, has enabled the study of evolutionary models without a closed form solution of the likelihood function. However, current Bayesian MCMC applications can incur significant computational costs, as they are based on a full sampling from the posterior probability distribution of the parameters of interest. Here, we draw attention as to how MCMC techniques can be embedded within normal approximation strategies for more economical statistical computation. The overall procedure is based on an estimate of the first and second moments of the likelihood function, as well as a maximum likelihood estimate. Through examples, we review several MCMC-based methods used in the statistical literature for such estimation, applying the approaches to constructing posterior distributions under non-analytical evolutionary models relaxing the assumptions of rate homogeneity, and of independence between sites. Finally, we use the procedures for conducting Bayesian model selection, based on Laplace approximations of Bayes factors, which we find to be accurate and computationally advantageous. Altogether, the methods we expound here, as well as other related approaches from the statistical literature, should prove useful when investigating increasingly complex descriptions of molecular evolution, alleviating some of the difficulties associated with nonanalytical models.

Keywords: Bayes factors; data augmentation; expectation maximization; gradient optimization; Laplace approximation; parameter expansion; thermodynamic integration

Received October 12, 2006; Revised January 29, 2007; Accepted May 30, 2007


In the 1980s, phylogenetic methods based on explicit mathematical models of molecular evolution were increasingly recognized for their statistical worth (Felsenstein, 1988), but were nonetheless considered computationally too prohibitive for inference based on large datasets. From a post-genomic standpoint, the phrase "large data sets" has acquired a whole new meaning, unfathomable to earlier practitioners. Modern computing machines make it possible to conduct probabilistic phylogenetic inference based on alignments of tens of thousands of residues. Nevertheless, even under such conditions, phylogenetic reconstruction artifacts are still observed in some cases (Philippe, 2005), implying that the underlying models of sequence evolution applied are too grossly misspecified; richer models, which more reasonably acknowledge molecular evolutionary patterns, are hence expected to exhibit greater robustness in wake of difficult phylogenetic questions (e.g., Ren et al., 2005; Baurain et al., 2007; Lartillot et al., 2007). In complement, better models can prove useful in elucidating particular aspects of molecular evolution (Pal et ai. 2006). From this perspective, phylogenetic analysis and evolutionary modeling have formed a complementary framework, driving an iterative research cycle grounded in model selection theory (Steel, 2005, Yang 2006).

Some of the significant and recurring hurdles to investigating more sophisticated models of molecular evolution are the associated implementation and computational difficulties. Indeed, the development of an algorithmic toolbox has been an intrinsic part of phylogenetic model building, now well covered in a recent string of books (Semple and 2003; Felsenstein, 2004; Gascue, 2005; Nielsen. 2005; Yang, 2006). In the last decade, Markov chain Monte Carlo (MCMC) techniques, typically invoked within a Bayesian framework, have permeated across several disciplines as general and unifying approaches to addressing many practical difficulties.

Under a model M, specified according to some high-dimensional hypothesis vector {theta} isin {Theta}, a Bayesian analysis conditions on a data set D, computing the posterior distribution p({theta} | D, M)from the prior distribution p({theta} | M) and the likelihood function p(D | {theta}, M) according to Bayes' theorem:


Formula 1

(1)
where


Formula 2

(2)
is a normalizing constant, also called the marginal likelihood. The most general MCMC scheme to approximating nonanalytical posterior expectations is to collect a large sample from (1) using the Metropolis-Hastings (MH) algorithm (Metropolis, 1953; Hastings, 1970). Given a reference hypothesis vector {theta}, the algorithm generates a new hypothesis vector {theta}' from a proposal density q({theta}', {theta})—which is generally easy to simulate from—and resets the reference vector to {theta} = {theta}'with probability {vartheta}:


Formula 3

(3)

Under certain conditions, repeatedly cycling over these steps forms a Markov chain with (1) as its stationary distribution (see, e.g., Robert, 2004). The first portion of the chain is typically discarded—the so-called burn-in period—and hypothesis vectors are drawn at regular intervals as the algorithm proceeds.

Two key features of the MH algorithm need be noted. The first is that the algorithm can be expanded to incorporate sampling over latent, or unobserved, states (van Dyk and Meng, 2001; Gelman, 2004). This property has allowed for the study of several new nonanalytical evolutionary models (Robinson et al. 2003; Lartillot and Philippe, 2004; Mateiu and Rannala, 2006; Huelsenbeck et al., 2006; Yu and Thorne, 2006a). The resulting sampling devices, on the other hand, can be elaborate, and most MCMC-based model development projects now involve rounds of comparisons and checks with other implementations, as well as computationally costly empirical assessments of convergence, burn-in time, decorrelation time, etc. In some cases, the MH kernel in (3) is itself approximated using CPU-intensive MCMC sampling methods (Robinson et al., 2003;, Rodrigueet al., 2005; Rodrigue et al., 2006; Yu and Thorne, 2006a), which usually require additional pilot runs and simulation checks.

The second important feature of the MH algorithm is that the marginal likelihood cancels out. Although this may enable a sampling from the posterior distribution, it does not allow for a direct estimate of a natural measure of model fit. In the Bayesian framework, two given models (M0, M1) can be compared based on the ratio of their respective marginal likelihoods, otherwise known as the Bayes factor (B01) (Jeffreys, 1935; Kass and Raftery, 1995):


Formula 4

(4)
The case B01 > 1 (B01 < 1) is considered as evidence in favor of M1 (M0). This provides an intuitively simple means of ranking models, with clear theoretical justifications contained within the laws of probability (Jaynes, 2003). On a practical level, the only method producing reliable estimates of Bayes factors across certain varieties of phylogenetic models is the thermodynamic integration approach, as described in Lartillot and Philippe (2006). Although this method has the advantage of general applicability, it can incur high computational costs; under certain model comparisons, a full thermodynamic integration may require weeks on a modern desktop computer, even under a fixed tree topology for small single protein data sets (Lartillot and Philippe, 2006; Rodrigue et al., 2006).

Thus, for some models of interest, the methods proposed to date bring us full circle to computational prohibitiveness, where relying on brute-force MCMC sampling from posterior distributions becomes overtaxing. Alternatives to brute-force sampling, however, are commonly applied in the statistical literature, with first approximations often based on assumptions of normality about a dominant posterior mode Gelman et al., 2004; Robert and Casella, 2004). These techniques have been adapted to cases of nonanalytical models, using Monte Carlo-based optimization schemes (e.g., Robert and Casella, 2004: chapter 5), and although they may fail when the posterior is not normally distributed, or if modes of significant density are missed or ignored, they may still serve as guides for constructing distributions (e.g., as posterior mode finders), and can provide rough estimates of Bayes factors (Gelman et al., 2004).

In this work, we explore the use of MCMC-based normal approximation strategies in a phylogenetic context, with the objective of enabling more tractable applications of evolutionary models that are too complex to be manipulated using conventional methods. Utilizing previously proposed MH operators, we first illustrate MCMC-based optimization algorithms, which can be used to estimate maximum likelihood (ML) parameter values, even under non-analytical models. In a second step, MCMC approaches are applied in conjunction with normal developments around the optimal point in order to approximate the posterior distribution. We further combine these different MCMC schemes and normal approximations into a Bayes factor estimator, based on a variant of the Laplace method (Raftery, 1996). The approaches are applied under a fixed tree topology, using three different types of models of amino acid sequence evolution, and the resulting approximations are compared with those obtained under existing brute-force MCMC methodologies.

Our findings indicate that normal approximation methods can produce faster explorations of posterior distributions than the standard MCMC devices. The approaches provide a particularly useful alternative under models where a full posterior sampling rests on estimates of the MH kernel itself. Also, the methods produce surprisingly accurate Bayes factor estimates, within considerably less CPU time than previously proposed techniques. Finally, we discuss other potential methodological variations, and suggest possible avenues to investigating the relations of model complexity, data set characteristics, and robust phylogenetic inference.


    Models
 Top
 Abstract
 Models
 Latent State Methods
 Normal Approximation Methods
 Data
 Results and Discussion
 Future Directions
 Appendix 1
 Appendix 2
 Acknowledgments
 References
 
Canonical Models
The likelihood function of what we will consider as the canonical phylogenetic models is defined according to a continuous-time Markov process, with a 20 x 20 infinitesimal generator specifying the instantaneous rate of substitution from one amino acid to another. When all types of substitutions are assumed to occur at equivalent rates, we refer to the model as POISSON. More commonly, however, the infinitesimal generator is fixed according empirically pre-determined values, such as the WAG matrix proposed by Whelan and Goldman (2001).

Under these models, the likelihood can be calculated in closed form, employing matrix diagonalization routines and the pruning algorithm (Felsenstein, 1981). This will thus provide an example for exploring latent state sampling approaches in a case where conventional methods are available.

Rate Heterogeneity
An important extension to the canonical phylogenetic models is the rates-across-sites approach proposed by Yang (1993, 1994). Under this model (+{Gamma}), the overall substitution rates of the columns of an alignment are treated as random variables, drawn from a statistical law: the gamma distribution of mean 1 and variance {alpha}–1. The likelihood function then takes the form of an integral over the statistical law.

In practice, however, integrating over the statistical law is not analytical, and the commonly adopted approximation procedure discretizes the gamma distribution into a predefined number of classes (typically 4 or 8), reducing integration to a weighted sum (Yang, 1994). In the present work, however, we will retain the continuous distribution as an example where parameter expansion approaches can be applied.

Structural Constraints
One clear oversimplification of all models mentioned above is the assumption of independence between sites. Inspired from the work of Parisi and Echave (2001), and the sampling schemes of Jensen and Pedersen (2000) and Pedersen and Jensen (2001), Robinson et al. (2003) have proposed a MCMC methodology allowing for a general interdependence between sites. Here, we focus on variants of their approach, based on the work presented in Rodrigue et al. (2005); Rodrigue et al. (2006).

The main motivation of site-interdependent models has been to incorporate explicit protein structure considerations within the phylogenetic context. We will borrow the nomenclature of Parisi and Echave (2001), and refer to the models as structurally constrained (SC). The centerpiece of SC models is a set of knowledge-basedor statistical potentials (e.g., Miyazawa and Jernigan, 1985; Sippl, 1990; Jones et al., 1992), which, in this context, are used to evaluate the compatibility of an amino acid sequence with a given protein structure. Formulated in terms of a pseudo-energy score for a sequence s in conformation c, written here as G(s,c), we will use the potentials derived in Kleinman et al. (2006), which include an amino acid contact potential, a solvent accessibility potential, as well as an account of compositional effects inspired from the random energy approximation (Shakhnovich and Gutin, 1993; Sun et al., 1995; Seno et al., 1998).

The continuous-time Markov chain under this model is specified as a sequence-wide process: for a sequence of length N the infinitesimal generator is a 20N x 20N matrix R with entries


Formula 5

(5)
where β is a parameter weighting the impact of G(s,c) on the rate of substitution. Here, however, the potentials used have been scaled such that we may fix this parameter at β = 1/2, and we refer to the model simply as SC. In some cases, it may be worthwhile to treat β as a free parameter (SC+β), in order to give some flexibility to the model, or if the scaling of the potential is unclear.

The order of the rate matrix given in (5) does not allow for a closed form likelihood, and this model serves as an example where data augmentation systems provide useful alternatives. In addition, it may be pertinent to combine SC and SC+β models with a direct account of rate heterogeneity (+{Gamma}), in which case we merge parameter expansion and data augmentation schemes, as we describe below.

Priors
Unless specified otherwise, we use the following default priors for {lambda} (the set of branch lengths), {alpha} (the shape parameter of the gamma law), and β (the pseudo-energy weighting):

  • {lambda} ~ Exponential, with a mean determined by a hyperparameter fixed at 0.1;
  • {alpha} ~ Exponential, with mean 1;
  • β ~ Uniform[–βmax, βmax], where βmax = 5.
We write {theta} in referring to all free parameters (i.e., {theta} = {{lambda}, {alpha}, β }, which is reduced in accordance when either of these parameters are fixed), such that M represents all fixed elements (e.g., the WAG matrix), as well as the overall construction of the model.


    Latent State Methods
 Top
 Abstract
 Models
 Latent State Methods
 Normal Approximation Methods
 Data
 Results and Discussion
 Future Directions
 Appendix 1
 Appendix 2
 Acknowledgments
 References
 
In this section, we discuss two types of latent state (demarginalization) techniques that can be used for approximating posterior distributions, and briefly present how the methods can be applied to process the nonanalytical models presented above. Although the details of these sampling methods have been described previously (see, e.g., Rodrigue et al., 2006), the concepts outlined here form the basis of the approximation approaches described in later sections.

Parameter Expansion
Parameter expansion (PX) (Liu et al., 1998) is a computational methodology that introduces new auxiliary parameters not identified under the closed form likelihood. For instance, under the +{Gamma} model, we may introduce a rate vector, written as r = (ri)1 ≤ i ≤ N, specifying the rate of each position. The marginal and joint probabilities, on {theta} and {theta}, r, are related as follows:


Formula 6

(6)


Formula 7

(7)
where p{alpha}(r) is the unit mean gamma density.

The basic idea underlying parameter expansion is that if a sample ({theta}(h), r(h))1 ≤ h ≤ K is drawn from the joint distribution p({theta}, r | D, M), then, the {theta} component of this sample, ({theta}(h))1 ≤ h ≤ K, is distributed according to p({theta} | D, M). Therefore, to obtain a sample from p({theta} | D, M), first draw a sample from p({theta}, r | D, M), and if only the parameter vector is of interest, discard the r component. This sampling approach can be written more formally, with a MH kernel defined as


Formula 8

(8)

In practice, MH operators are typically applied separately on model parameters and auxiliary parameters; the basic sampling module in this case is referred to as the PX module, and is written symbolically as


Formula

which is to say that first, a rate vector is sampled conditional on the current parameter vector, followed by a parameter vector sampled conditional on the current rate vector. Also note that we may take the sample of auxiliary parameters seriously, for instance, by constructing the posterior distribution of a site-specific rates (Mateiu and Rannala, 2006).

Data Augmentation
The phrase data augmentation (DA) refers to another computational scheme introducing new latent (unobserved) data, somehow partially identified by the actual (observed) data (Tanner and Wong, 1987). This methodology is well entrenched in the statistical literature (van Dykand Meng, 2001), and is increasingly being exploited in phylogenetic contexts (e.g., Jensen and Pedersen, 2000; Pedersen and Jensen, 2001; Nielsen, 2002; Robinson et al., 2003; Mateiu and Rannala, 2006; Lartillot, 2006). In particular, it is central to the framework proposed by Robinson et al. (2003), for implementing SC-type models.

These sampling devices operate on the basis of specific realizations of the Markov process, also referred to as transition paths (Jensen and Pedersen, 2000; Pedersen and Jensen, 2001; Robinson et al., 2003), or mappings (Nielsen, 2002). A particular mapping, represented by {varphi}, is described by the timing and nature of each substitution event along the branches of a given tree, and establishes the augmented likelihoodp(D, {varphi} | {theta}, M), related to the actual likelihood according to


Formula 9

(9)
For now, we assume a uniform rate vector (i.e., {forall} i, ri = 1), and define the MH kernel as


Formula 10

(10)

Similarly as for the PX scheme, most implementations apply MH operators separately on model parameters and data augmentations, with the DA sampling module written symbolically as


Formula

As before, the effect of cycling over this module is a sample of parameter vectors distributed according to p({theta} | D, M), and is strictly equivalent to what we would obtain if we had explicitly computed the integral of Equation (9).

Combined Parameter Expansion and Data Augmentation
Parameter expansion and data augmentation schemes may be combined together (Liu andWu, 1999), defining the MH kernel as


Formula 11

(11)
and the PX-DA sampling module as


Formula

Both parameter expansion and data augmentation are powerful computational tools for implementing statistical models. Paradoxically, even when closed form likelihoods are available, these types of demarginalized sampling often improve the overall mixing of the MCMC (van Dyk and Meng, 2001; Gelman et al., 2004), and the techniques are often used on this basis alone (Gelman, 2004). All MH operators for PX, DA, and PX-DA sampling modules for the models studied here have been described previously (e.g., Rodrigue et al., 2006).


    Normal Approximation Methods
 Top
 Abstract
 Models
 Latent State Methods
 Normal Approximation Methods
 Data
 Results and Discussion
 Future Directions
 Appendix 1
 Appendix 2
 Acknowledgments
 References
 
Posterior Distributions
The latent state methods described in the previous section may be viewed as brute-force approaches to approximating posterior distributions. By this we mean that whenever some density is unavailable by analytical means, we pose a new MH kernel, and revise sampling modules with new update operators, always sampling over the full posterior distribution. As an alternative, however, we may assume that the posterior is normally distributed, and attempt to estimate its mean and variance. For simplicity, we focus on estimating the mean and variance of the posterior of a particular component of the parameter vector, and assume, for now, that the rest of the parameter vector is known (i.e., for now, {theta} is treated as univariate). We begin by estimating the mean, assuming that the prior on {theta} is uniformly distributed over some broad interval, and that the mean of the posterior corresponds to the ML parameter estimate.

First, under analytical models, it is possible to apply the simulated annealing technique proposed by Kirkpatrick et al. (1983). Drawing on an analogy with thermodynamics, the method consists in heating the MCMC sampler, by introducing a parameter {tau}, mediating the temperature (1/{tau}) of the chain. With a uniform prior, the MH kernel becomes


Formula 12

(12)
As {tau} -> {infty}, the chain freezes, because an update leading to a lower likelihood has a progressively lower probability of acceptance. The algorithm is thus hoped to converge to the ML point.

An important aspect of simulated annealing is the cooling schedule, which is typically explored empirically (Nourani and Andresen, 1998). We explored two simple cooling schedules here. The first, which we refer to as proportional cooling, updates the value of {tau} at iteration n according to


Formula 13

(13)
where {delta} > 1 serves to tune the cooling scheme. In contrast, under linear cooling the restriction is {delta} > 0, updating {tau} according to


Formula 14

(14)

The simulated annealing optimization may be useful in a variety of situations, but may nevertheless be unsuitable when the likelihood is unavailable in closed form. For such situations, however, we may rely on latent state methodologies. Note that when working with a nonanalytical model, for instance relying on a DA scheme, the gradient of the log-likelihood is given by


Formula 15

(15)
where (.) stands for an expectation over the distribution of latent states. The gradient can be approximated by


Formula 16

(16)
where ({varphi}(h))1 ≤ h ≤ K is a set of sampled augmentations, drawn using the first element of the DA module. This gradient approximation can then be embedded within classical optimization methods, for example, following the gradient according to an iterative updating, with cycle n given by


Formula 17

(17)
with {delta} > 0 now acting as a predefined step parameter. Cycling between augmentation steps and gradient steps can be repeated until the gradient vanishes, thus declaring the ML estimate Formula . We refer to this algorithm as Monte Carlo gradient (MCG) optimization.

It is often also possible to apply the expectation maximization (EM) algorithm (Dempster et al., 1977) in conjunction with data augmentation schemes (Wei and Tanner, 1990), again using a sample for estimating the expectation (E-step), followed by a maximization (M-step):


Formula 18

(18)
This inner maximization step is often analytical, but can otherwise be accomplished using gradient or Newton methods (Appendix 2). We refer to this algorithm as the Monte Carlo EM (MCEM).

Once the optimal parameter value has been found, we may estimate the variance at this point as


Formula 19

(19)
The second derivative of the log-likelihood may be expressed as


Formula 20

(20)
and the Monte Carlo evaluation is given by


Formula 21

(21)

Analogous schemes for estimating the mean and variance under PX and PX-DA contexts can be devised, and can be extended for joint applications over many parameters (see developments in Appendices 1 and 2).

Bayes Factors
The Laplace method for estimating the marginal likelihood is given as (see, e.g., Tierney and Kadane, 1986)


Formula 22

(22)

where y is the dimension of the model, Formula is the parameter vector maximizing the posterior probability, and Formula is minus the inverse Hessian matrix (of second derivatives) evaluated at Formula . An important variant on (2), suggested by Raftery (1996), consist of substituting Formula with Formula and Formula with Formula (the inverse of Formula is otherwise referred to as the Fisher information matrix). This variant slightly simplifies the mathematical developments, and has the advantage of potential applicability with any maximum likelihood implementation. As such, based on the ML parameter vectors of two models (Formula 0, Formula 1), we will use the following Laplace approximation to the Bayes factor:


Formula 23

(23)

The second term in (4) can be calculated from the developments in Appendix 1. If the models are not analytical, the third term is calculated using a thermodynamic integration method, which, for a given tree configuration, computes the log-likelihood difference under two different models; these calculations are restricted versions (with fixed parameter values) of the more general thermodynamic integration methods for evaluating differences of log marginal likelihoods between pairs of models (Lartillot and Philippe, 2006; Rodrigue et al., 2006).


    Data
 Top
 Abstract
 Models
 Latent State Methods
 Normal Approximation Methods
 Data
 Results and Discussion
 Future Directions
 Appendix 1
 Appendix 2
 Acknowledgments
 References
 
For illustrative purposes, we apply the techniques to a small data set of 20 tetrapod myoglobin sequences, with the topology fixed to the ML tree obtained using PhyML (Guindon and Gascuel, 2003) under the WAG+{Gamma} model. The contact map and solvent accessibility profile, used with SC models, are derived from a reference structure, resolved by x-ray crystallography (Protein Data Bank accession number 1MBD [PDB] ). All data and source code files are available at the Systematic Biology website (http://systematicbiology.org).


    Results and Discussion
 Top
 Abstract
 Models
 Latent State Methods
 Normal Approximation Methods
 Data
 Results and Discussion
 Future Directions
 Appendix 1
 Appendix 2
 Acknowledgments
 References
 
MCMC-Based Optimization: An Analytical Example
Before applying the methods developed above to nonanalytical models, we first explore the properties of MCMC-based optimizations under a simpler case, where comparisons can be made with other implementations. Specifically, we apply different approaches to maximizing the log-likelihood with respect to branch lengths for a fixed tree under the WAG model.

First, under this model, the simulated annealing method can be applied. Figure 1a shows the evolution of the overall tree length during the first 100 iterations of a simulated annealing run, based on a proportional cooling schedule, with the initial {tau} = 1 increased to {tau} > 10,000 according to (3), with {delta} = 1.1. As can be seen, the chain begins with a somewhat erratic behavior, oscillating around, yet progressively gravitating towards, the tree length obtained using the PAML package (Yang, 1997). Ultimately, however, after 100 iterations, the chain slightly misses the mark.

We found the linear cooling scheme easier to adjust than proportional cooling, and less likely to become trapped in sub-optimal configurations as the chain approaches the freezing point. In Figure 1b, we started from {tau} = 1, and updated according to (4), with {delta} = 100. The chain converges to essentially identical branch length values as returned by PAML in about 35 iterations. Tuning {delta} = 500 (Fig. 1c), the ML branch lengths were obtained in about 18 iterations.


Figure 1
View larger version (83K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Figure 1 Markov chain Monte Carlo maximum likelihood estimation of the tree length. In a, b, and c simulated annealing is used. In d, e, and f we use Monte Carlo gradient optimization from two starting points, and each based on a sample of 100 mappings. In g, h, and i we use Monte Carlo expectation maximization, again from two starting points, based on 10 (g), 100 (h), and 1000 (i) mappings. In each panel, a dashed line is drawn for the tree length returned by PAML (Yang, 1997).

 
We next explored the MCG algorithm, as a first latent state optimization scheme. Nielsen (2002) has proposed a straightforward DA method, which, under models like WAG, allows for a direct sampling of substitution mappings. We used Nielsen's method to draw a sample of mappings for estimating the log-likelihood gradient, as written in (6), in a MCG optimization of branch lengths. As illustrated in Figure Figure 1d, Figure 1e, and Figure 1f, a significant amount of trial-and-error tuning of the gradient optimization method can be important for reducing CPU time. In this case, expectations were estimated based on a sample of 100 mappings, and only the step parameters ({delta}) of the iterative scheme in (6) were adjusted. As crude explorations, we set the same value for each branch length step parameter throughout the run, with {delta} = 0.000001 in (Figure 1d, {delta} = 0.00001) in Figure 1e, and finally {delta} = 0.00005 in Figure if.

We also tried the MCEM algorithm in the present example. We once again relied on Nielsen's method, drawing samples of substitution mappings for the expectation estimate, followed by the maximization step given in (8). In this case, the precision of the algorithm depends solely on the sample size used to estimate the expectation, since the maximization step is analytical (Appendix 2). Using a sample of 10 mappings, significant fluctuations of the overall tree length are observed from one MCEM iteration to the next (Figure 1g). Fluctuations are reduced using 100 mappings (Fig. 1h), and become negligible (± 0.001 natural log-likelihood units) using 1000 mappings (Fig. 1i).

This corroboration across methods, as well as with the PAML package, is a useful check, and helps in getting a sense of the general behavior of the MCMC methods. In this particular case, we give preference to the MCEM algorithm, if only for the fact the tuning is exclusively based on sample size for the E-step. In fact, the sample size can be increased "online," for instance, by a factor of 10 every 10 iterations, or according to any other scheme. It should be noted, however, that the Monte Carlo error only decreases with the square root of the sample size, and that the MCEM is not necessarily the best choice for all contexts in terms of computational requirements, as we illustrate in an example below.

MCMC-Based Optimization: Nonanalytical Examples
In the preceding subsection, we applied Monte Carlo techniques for parameter optimization to a case where such methods are unnecessary. Indeed, PAML outperformed all of these MCMC optimization approaches in terms of computational time, and the methods may have little value under simpler models. In this subsection, however, we explore nonanalytical models for which standard optimization techniques are not directly possible.

Our first nonanalytical example consists of optimizing the shape parameter, {alpha}, for the +{Gamma} model, still using the WAG matrix, and, for now, with fixed branch lengths (as obtained under WAG). Starting from two different initial values, Figure 2 shows the progression of {alpha} as a function of the MCEM iterations (sampling rate vectors in the E-step). Once again, the MCEM algorithm converges quickly—within about 20 iterations—and the fluctuations of the estimate progressively decrease as the number of rate vectors sampled at each iteration increases from 10 (Fig. 2a) to 100 (Fig. 2b) to 1000 (Fig. 2c). The final value reached is Formula = 0.73. Although this estimate is not directly comparable with the discrete gamma models, we ran PAML using different numbers of categories. In general, the estimates are quite similar; using 4 categories, PAML returns Formula = 0.72, 8 categories gives Formula = 0.69, 16 categories gives Formula = 0.68, and finally, using 32 categories, the estimate is Formula = 0.70. These mild fluctuations illustrate how the number of categories used alters the gamma approximation, and while the discrete approximation may be suitable for many practical applications, PX methods for continuous distributions could have several advantages (Mateiu and Rannala, 2006), particularly when discretization procedures are in doubt (e.g., Yang et al., 2000; Susko et al., 2003; Mayrose et al., 2005) or when site-specific random variables are multivariate (e.g., Lartillot and Philippe, 2004; Pond and Muse, 2005).


Figure 2
View larger version (39K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Figure 2 Monte Carlo expectation maximization algorithm for estimating Figure 2 from two starting points. The E-step of the algorithm—the Monte Carlo estimate of the expectation—is done with 10 a), 100 b) and 1000 c) draws.

 
Our next nonanalytical example concerns the SC+β model, where we wish to optimize β, still based on fixed branch lengths. We first ran an MCEM optimization using a sample of 100 mappings, and 100 sequences (for the approximation given in Equation (8)). Note that in this context, Nielsen's method serves to generate a proposed mapping (Rodrigue et al., 2005), to be accepted or rejected according to the full site-interdependent MH rule. Fig. 3a shows the first 20 iterations of the MCEM, which displays a jagged behavior in attempting to adjust the value of β so as to cancel out the two terms of the derivative of the log-likelihood function (see Equation (4)). In contrast, the MCG optimization under the same sample size conditions is much more efficient, converging with 5 iterations (Fig. 3b).


Figure 3
View larger version (23K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Figure 3 Monte Carlo estimation of β. In a, the Monte Carlo expectation maximization algorithm is used, with the Monte Carlo estimate of the expectation based on 100 draws. In b, the Monte Carlo gradient optimization is used from two starting points, each based on 100 draws.

 
In both of these examples, it is interesting to note that while we have adjusted parameters so as to maximize the log-likelihood, we have not computed the log-likelihood itself. This decoupling between log-likelihood optimization and log-likelihood calculation is a key feature of latent state methodologies, and is analogous to the property allowing us to sample from the posterior without having a closed form likelihood.

Normal Approximations of Posterior Distributions
The use of normal approximations in Bayesian analysis often serves as a first step to constructing posterior distributions under new statistical models (Gelman et al., 2004). We consider +{Gamma} and SC-type models here, and focus on their distinguishing parameters ({alpha} and β respectively).

First, under the WAG+{Gamma} model, we marginalized over branch lengths using a PX sampling module, while optimizing with respect to {alpha} (here given a uniform prior) using the MCEM algorithm. Doing so simplifies the example, in that it remains univariate, while allowing us to focus on the full posterior of {alpha}. The final MCEM iterations were based on a sample of 100 sets of branch lengths and rate vectors, as was the variance estimate (referring to Equation (9)). We used the estimates of the mean and variance for tracing a normal probability density function, and compared this trace to the density histogram obtained using the PX module sampling rates, branch lengths, and {alpha} (Figure 4). The two different density plots are reasonably similar, although the histogram appears skewed to the right, particularly when {alpha} > 1. Indeed, in this range, the shape of the gamma distribution does not undergo dramatic changes with small variations in {alpha}, which leads to a flattened out likelihood surface. This illustrates an important point: given limited data, the full posterior may differ from a normal distribution, and such approximations are only meant to give a general sense of location and diffuseness for a parameter of interest.


Figure 4
View larger version (17K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Figure 4 Posterior density plot of {alpha}, approximated using full Metropolis-Hastings sampling (histogram) and a normal approximation (dashed line).

 
Our second example concerns the SC+β model, where alternatives to constructing posterior distributions are of particular interest. Under these models, the stationary probability of the Markov process involves a normalizing constant, written as Y, which is a sum over all 20N sequences, and thus not tractable (see Appendix 1). When proposing new values β', the ratio of two such normalizing constants must be approximated in the MH kernel. Previous works investigating SC-type models have relied on an importance sampling approximation, as proposed by Robinson et al. (2003). Adapted to the present context, the approximation reads as


Formula 24

(24)


Formula 25

(25)
where (s(h))1 ≤ h ≤ K is a set of sequences sampled under β* using a Gibbs sampling approach (see Robinson et al., 2003), and where β* is chosen to be as close as possible to the middle of β and β'. We used the variation given in Rodrigue et al. (2005) for choosing β* during the MCMC, and ran a full sampling over branch lengths and β. In another run, we marginalized over branch lengths using a DA module, while optimizing β using the MCG algorithm. We relied on a sample of 100 sets of branch lengths (and mappings) and 1000 sequences (see Appendix 1, Equation (8), (5) and (6)) for the final iterations of the MCG method, and for the subsequent variance estimate. As shown in Figure 5a, the normal probability density function based on the mean and variance estimates matches well with the density histogram obtained using the full MCMC, although the normal approximation comparatively underestimates the variance to a small degree.


Figure 5
View larger version (33K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Figure 5 Posterior density plot of β. In panel a, a histogram was generated using a full Metropolis-Hastings sampling. Panel b shows a density trace generated using thermodynamic integration(Rodrigue et al. 2006). In both panels, the normal approximation is shown (dashed line).

 
Given that the full MCMC sampling of β is based on the approximation in (4), which could breach the conditions of Markov chain convergence theorems, we performed a third run, using the thermodynamic integration method described in Rodrigue et al. (2006). This last method has the advantage of arbitrary accuracy, at the cost of CPU time, and the slight ruggedness of the posterior density trace (Figure 5b) gives a qualitative sense of the Monte Carlo fluctuations over the course of the integration. Here again, though, the posterior density of β obtained using the thermodynamic method matches well with the normal approximation, providing a reasonable corroboration across all methods. On the other hand, the normal approximation requires only a fraction of the CPU time of either of the two other methods. This may prove useful when the main interest is the posterior distribution of β or analogous parameters(Robinson et al., 2003; Rodrigue et al., 2005), particularly when approximating posteriors over several different data subsets (Yu and Thorne, 2006a).

Normal Approximation of Bayes Factors
Finally, we applied the Laplace approximation approach to estimate Bayes factors across all models mentioned herein, as well as the thermodynamic integration methods detailed in Lartillot and Philippe (2006) and Rodrigue et al. (2006). As mentioned previously, the thermodynamic method can be tuned to any desired accuracy, and we use the results under this approach as our reference values. Our crude strategy here consisted in running triplicates of each type of calculation, progressively tuning the MCMC samplers such that, when rounding to the nearest natural log unit, identical results are obtained for all three runs. We then compared accuracy and CPU time of the two methods.

For the Laplace method, we first maximized the log-likelihood with respect to branch lengths and, as applicable, {alpha} and β. For all but SC-type models we used the MCEM algorithm for the overall optimization. For SC+β-type models, however, we used a combined MCEM-MCG algorithm, which, at each iteration, performs an M-step on branch lengths (and {alpha}, if applicable) and a gradient step on β. In all cases, the final expectation estimates for optimization and for the Laplace approximation were based on samples of 10,000 substitution mappings, rate vectors (for +{Gamma} cases), and sequences (for +β cases). For a particular configuration, computing log-likelihood differences between an analytical and a nonanalytical model was done using constrained forms of the thermodynamic methods described in Lartillot and Philippe (2006) and Rodrigue et al. (2006).

The resulting Bayes factors are remarkably accurate, when compared to full thermodynamic estimates, with at most one log unit difference (Table 1). Importantly, however, the Laplace approximation required much less CPU time. The reasons for such a reduced computational time are a combination of several factors. First, as opposed to a full MCMC sampling over all admissible parameter settings, optimizations are directed toward a single optimal point. If convergence to this point is fast, far fewer likelihood function evaluations will be needed than would a full-blown sampling from the posterior. Also, the algorithms can be used with very small samples (of say 10) to obtain crude parameter estimates to be used as the starting point of more refined MCEM or MCG runs, and so on. Indeed, in our analyses, we always preceded the final iterations of MCEM or MCG with such crude estimations, which could be obtained within minutes. Next, the MCEM and MCG algorithms, and the constrained thermodynamic method considerably reduce the overall sampling, as marginalization via MCMC is focused on latent states. Lastly, the Laplace approximation for Bayes factors relies on the assumption of normality around the optimal point, making use of an estimate of the curvature of the likelihood surface; loosely speaking, the full thermodynamic method must effectively obtain this information using brute-force sampling.


View this table:
[in this window]
[in a new window]

 
Table 1. Natural logarithm of the Bayes factor for models considered, with POISSON used as a reference.

 
The model rankings obtained using Bayes factors give favor to the WAG+{Gamma} model. Note, however, that the pure SC model outperforms the POISSON+{Gamma}, although it is in turn outperformed by the pure WAG model. This ranking is reasonably encouraging for SC-type models, and additional work is needed to determine if more sophisticated statistical potentials can achieve, or surpass, the performance of the best site-independent models (Rodrigue et al., 2006; Kleinman et al., 2006).


    Future Directions
 Top
 Abstract
 Models
 Latent State Methods
 Normal Approximation Methods
 Data
 Results and Discussion
 Future Directions
 Appendix 1
 Appendix 2
 Acknowledgments
 References
 
Methodological and Statistical Investigations
Complementing MCMC methods and normal approximations considerably reduces the needed computational resources for conducting Bayesian calculations. However, as evolutionary models increase in sophistication, and as the needed sampling schemes become more elaborate, or based on additional levels of approximation, the difficulties commonly associated with MCMC (e.g., assessments of convergence and mixing behavior) are likely to be exacerbated, and the methods should be approached with caution. First, several convergence diagnostics have been proposed (see Cowles and Carlin, 1996; Robert and Casella, 2004), although ultimately, the best assessments typically require running several instances of the chosen algorithm from different starting points, and comparing the resulting estimates (e.g., Figure 6a). Regarding mixing kinetics, in general, one can get a good sense of the behavior by choosing a key statistic for that being sampled, and displaying autocorrelation functions (e.g., Figure 6b), or other graphical analyses (Gelman et al., 2004). The subject of tuning Monte Carlo samplers in the context of MCG or MCEM algorithms is itself an active domain of research (e.g., Booth and Hobert, 1999; Levine and Casella, 2001; Fort and Moulines, 2003; Caffo et al., 2005), and transposing these ideas within the present applications may give rise to useful guidelines for greater computational efficiency. In some cases, it may even be useful to trace out the log-likelihood along a particular dimension of the model, either keeping all are other elements fixed (Figure 6c), or while marginalizing over other parameters (Rodrigue et al., 2006). Also, the choice among possible MCMC methods can be bewildering (see, e.g., Gelman et al., 2004; Robert and Casella, 2004), and it may be difficult to know beforehand which overall scheme gives sufficiently accurate estimates in reasonable compromise with computational effort. As we illustrated for optimizations of β, sampling and algorithmic choices will likely need to be explored empirically for each new context.


Figure 6
View larger version (43K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Figure 6 Exploring sampling space and Markov chain Monte Carlo (MCMC) behavior. Panel a shows the correlation of maximum likelihood branch lengths estimated in two distinct simulated annealing runs. Panel b shows the autocorrelation function of the rate entropy when sampling rate vectors under WAG+{Gamma}. In this case, a MCMC cycle consists of one Metropolis-Hastings update to the rate of each site. Based on this plot, one would estimate that at least 10 cycles should be done in order to effectually be sampling independent rate vectors. Panel c shows a trace of the log-likelihood as a function of β (SC model), drawn using thermodynamic integration as described in Rodrigue et al. (2006).

 
The approaches employed here could also be adapted and reconfigured in several ways. For instance, here, for estimating Bayes factors, we used thermodynamic MCMC to integrate over latent states and the Laplace method to integrate over parameter space. However, if greater accuracy were needed, or if assumptions of normality no longer hold, we could also extend the thermodynamic method over any subset of parameters and apply the Laplace method over the remaining parameter(s). In addition, other approaches to the Laplace method have been proposed, several of which do not require computing derivatives. Referring to Equation (2), Formula and Formula could be approximated based on the output of a plain MCMC run, using the component-wise posterior mean or median and the posterior variance-covariance matrix; other choices are also given in Lewis and Raftery, (1997).

Also, we stress here that while common Bayesian discourses often describe MCMC methodologies as alternatives giving nonanalytical modeling flexibility(e.g., Paap, 2002; Brooks, 2003; Beaumont and Rannala, 2004), such features are not exclusive to Bayesian contexts. As we have shown here, MCMC techniques can also be used to instantiate the maximum likelihood principle, such that they may be viewed as general, and independent of any particular probabilistic paradigm. As such, other model selection approaches could be applied; for evolutionary models assuming independence between sites, comparisons could be made using information theoretic criteria (e.g., Akaike, 1974), and the method of cross-validation (Stone, 1974) should be applicable in general, as the size of data sets increases to include multiple proteins (or genes). More work is needed to clarify the properties of these and other statistical measurements in a phylogenetic context, in particular as data sets grow in size, and as the complexity of models increases (Alfaro and Huelsenbeck, 2006).

Towards Topology Comparisons under Complex Models
Demarginalized sampling methods render tractable many new models of molecular evolution. Future work exploiting these methods could include explorations into the intersections of severals recent modeling approaches (e.g., Lartillot and Philippe, 2004; Pagel and Meade, 2004; Kolaczkowski and Thornton, 2004; Spencer et al., 2005; Blanquart and Lartillot, 2006). Also note that the methods used for deriving the statistical potentials of our SC models rely on similar MCMC devices to those applied here (Kleinman et al., 2006). As larger data sets are assembled, it may be pertinent to merge the derivation of statistical potentials directly into the phylogenetic context.

In addition, embedding MCMC schemes within normal approximations, or other approaches not studied here, suggests a framework for exploring different tree topologies under complex models in a straightforward way; even if sampling techniques are constrained to a single fixed topology, one can envisage comparing the marginal (or maximum) likelihood values of a set of competing trees. In cases where a search over tree space is too costly, such deliberate comparisons of candidate trees are likely to be the most sensible strategies (Ren et al., 2005). In particular here, tree comparisons under site-interdependent SC-based models have been considered too difficult in practice (Rodrigue et al., 2005; Yu and Thorne, 2006b) but could be investigated with the methods used here. More empirical work is needed to determine how different models rank a set of candidate trees under different data set characteristics. Foremost, however, it will be important to conduct larger scale studies to determine whether the level of accuracy needed for evaluating the sometimes small log-likelihood differences between competing trees is attainable, or whether further iterations of methodological developments are needed for refining computational devices.


    Appendix 1
 Top
 Abstract
 Models
 Latent State Methods
 Normal Approximation Methods
 Data
 Results and Discussion
 Future Directions
 Appendix 1
 Appendix 2
 Acknowledgments
 References
 
Here, we outline first and second derivatives needed for the Monte Carlo optimizations and Laplace approximations. We present the developments under the site-interdependent models (allowing for gamma-distributed rates) with the understanding that the equations can be easily factored out under models assuming independence.

For a data set D, composed of an alignment of P amino acid sequences, and given a tree topology and parameters {theta}, the demarginalized likelihood function is given as:


Formula 26

(26)
where the dependence on M has been dropped out from the notation. Each factor in (6) is detailed here.

For a specific branch j, the augmented transition probability is given as


Formula 27

(27)
where,

  • sj represents the sequence at node j (a node has the same index as the branch leading to it), and sjup is the sequence at the node ancestral to j;
  • {varphi}j represents the substitution mapping along branch j;
  • {lambda}j is the length of branch j;
  • zj stands for the total number of substitution events along branch j;
  • tjk represents the timing of substitution event k on branch j;
  • sjk–1 and sjk represent the amino acid sequence states before and after substitution event k—the states before the first and after the last substitution leading to node j are equivalent to the states at the ends of the branch, written symbolically as sj0 = sjup and sjzj = sj;
  • {sigma}jk is the site of substitution k along branch j;
  • {Upsilon}(sjk–1) = {sum}limitsi = 1N {sum}limitssi'Rsjk–1si'ri represents the rate away from state sjk–1, with the inner sum being over the 19 sequence states that differ with sjk–1 at position i.
The stationary distribution of the Markov process, appearing in (6), is given by:


Formula 28

(28)
where s0 represents the sequence at the root node (labeled as node 0), and Y is the associated normalizing constant


Formula 29

(29)
summing is over all 20N sequences.

Let y be the dimension of {theta}, and let v and w index the entries in {theta} (i.e. 1 ≤ v, w, ≤ y). The first derivative of the actual log-likelihood function is written as


Formula 30

(30)
and the second derivative as


Formula 31

(31)

Each of the expectations in (0) and (1) can be estimated based on samples drawn using the first two elements of the PX-DA module. The last term in (1) requires that we compute the second derivatives. The first derivative is a vector, written as


Formula 32

(32)
The second derivative therefore yields a matrix, where, for each entry in (2), the derivative is taken once again with respect to each parameter:


Formula 33

(33)
Observing the structure of (2) and (3), we compute the necessary derivative combinations in the following subsections.

Computing {partial}/{partial}β
The first derivative of the augmented log-likelihood with respect to β involves two types of terms:


Formula 34

(34)

The first term in (4) is given by


Formula 35

(35)


Formula 36

(36)
where (.) stands for an expectation with respect to (8). This expectation can be estimated based on a sample of sequences (s(h))1 ≤ h ≤ K, drawn from (8) using the Gibbs sampling procedure described in Robinson et al. (2003):


Formula 37

(37)


Formula 38

(38)

The second term in (34) is given as:


Formula 39

(39)
which can be calculated from


Formula 40

(40)
and


Formula 41

(41)

Computing {partial}/{partial}{lambda}j
For computing derivatives with respect to branch lengths, it is more practical to re-parametrize (7) using the following change of variables:


Formula 42

(42)
with uj = (ujk)k ≤ zj defined as


Formula 43

(43)
In equation (2), the factor Formula can be developed as


Formula 44

(44)
such that an alternative to the augmented transition probability can be written as


Formula 45

(45)
In logarithmic form, the derivative is thus given by


Formula 46

(46)
which can be evaluated based on


Formula 47

(47)
and


Formula 48

(48)

Computing {partial}/{partial}{alpha}
Computing the derivative with respect {alpha} only involves the prior on rates:


Formula 49

(49)
The derivative is thus given by


Formula 50

(50)
where {Psi}({alpha}) = {partial}/{partial} {alpha}ln {Gamma}({alpha}) is known as the digamma function, for which estimating routines are available (Galassi et al., 2003).

Computing {partial}2/{partial} β2
Computing this second derivative requires two terms. First, we have the following:


Formula 51

(51)


Formula 52

(52)

The next term involves the stationary probability:


Formula 53

(53)


Formula 54

(54)
where the expectations can again be estimated using the Gibbs sampling method of Robinson et al. (2003), giving


Formula 55

(55)
and


Formula 56

(56)

Computing {partial}2/{partial} β{partial}{lambda}j
This only requires terms already derived, as given in (1).

Computing {partial}2/{partial}{lambda}j2
Referring to (47), this second derivative requires the following term:


Formula 57

(57)


Formula 58

(58)

Computing {partial}2/{partial} {alpha}2
For gamma distributed rates, we get:


Formula 59

(59)
where {Psi}' = {partial}2/{partial}{alpha}2 ln {Gamma}({alpha}) can again be approximated using standard routines (Galassi et al., 2003).


    Appendix 2
 Top
 Abstract
 Models
 Latent State Methods
 Normal Approximation Methods
 Data
 Results and Discussion
 Future Directions
 Appendix 1
 Appendix 2
 Acknowledgments
 References
 
In this appendix, we give details of the M-step for the MCEM algorithm.

The M-step in the case of branch lengths is an example of the ideal case, where we have an analytical solution. Specifically, at iteration n of the MCEM algorithm, each branch length is updated as


Formula 60

(60)
where


Formula 61

(61)
Writing (z(h)j)1 ≤ h ≤ K for the number of substitutions along branch j of draw h, and (u(h)jk)1 ≤ h ≤ K for the re-parameterized configuration of each mapping, the needed expectations are estimated as


Formula 62

(62)
and


Formula 63

(63)

The M-step for optimizing {alpha} and β, however, is not direct, because solving for these parameters is not possible. Nonetheless, this inner maximization can be readily done using a gradient scheme similar to that described in the main text; using the same sample, gradient steps are performed repeatedly until the maximum is reached, following which a new sample is drawn for the next MCEM cycle, and so on. In the case of {alpha}, we also tried maximization using a Newton–Raphson-like method; the M-step is accomplished through an iterative updating, with cycle m given by


Formula 64

(64)
where


Formula 65

(65)


    Acknowledgments
 Top
 Abstract
 Models
 Latent State Methods
 Normal Approximation Methods
 Data
 Results and Discussion
 Future Directions
 Appendix 1
 Appendix 2
 Acknowledgments
 References
 
We wish to thank David Bryant for discussions, as well as Frédéric Delsuc, Roderic Page, Paul Lewis, and two anonymous referees for comments on the manuscript. We also thank the Réseau Québécois de calcul de haute performance for computational resources. This work was supported by Génome Québec, the Canadian Institute for Advanced Research, the Canadian Research Chair Program, the Centre National de la Recherche Scientifique (through the ACI-IMPBIO Model-Phylo funding program), the 60ème commission franco-québécoise, and the Robert Cedergren Centre.


    References
 Top
 Abstract
 Models
 Latent State Methods
 Normal Approximation Methods
 Data
 Results and Discussion
 Future Directions
 Appendix 1
 Appendix 2
 Acknowledgments
 References
 

    Akaike H. A new look at the statistical model identification. IEEE Transactions on Automatic Control AC (1974) 19:716–723.[CrossRef]

    Alfaro M., Huelsenbeck J. Comparative performance of Bayesian and AIC-based measures of phylogenetic model uncertainty. Syst. Biol. (2006) 55:89–96.[Abstract/Free Full Text]

    Baurain D., Brinkmann H., Philippe H. Lack of resolution in the animal phylogeny: Closely spaced cladogeneses or undetected systematic errors? Mol. Biol. Evol. (2007) 24:6–9.[Abstract/Free Full Text]

    Beaumont M. A., Rannala B. The Bayesian revolution in genetics. Nat. Rev. Genet. (2004) 5:251–261.[Web of Science][Medline]

    Blanquart S., Lartillot N. A Bayesian compound stochastic process for modeling nonstationary and nonhomogeneous sequence evolution. Mol. Biol. Evol. (2006) 23:2058–2071.[Abstract/Free Full Text]

    Booth J. G., Hobert J. P. Maximizing generalized linear mixed model likelihoods with an automated Monte Carlo EM algorithm. J. R. Stat. Soc. B (1999) 61:265–285.[CrossRef]

    Brooks S. P. Bayesian computation: a statistical revolution. Phil. Trans. R. Soc. Lond. A (2003) 361:2681–2697.[CrossRef]

    Caffo B. S., Jank W. S., Jones G. L. Ascent-based Monte Carlo EM. J. R. Stat. Soc. B (2005) 67:235–252.[CrossRef]

    Cowles M. K., Carlin B. P. Markov chain Monte Carlo convergence diagnostics: A comparative review. J. Am. Stat. Assoc. (1996) 91:883–904.[CrossRef][Web of Science]

    Dempster A. P., Laird N. M., Rubin D. B. Maximum likelihood from incomplete data via the EM algorithm. J. R. Stat. Soc. B (1977) 39:1–22.

    Felsenstein J. Evolutionary trees from DNA sequences: A maximum likelihood approach. J. Mol. Evol. (1981) 17:368–376.[CrossRef][Web of Science][Medline]

    Felsenstein J. Phylogenies from molecular sequences: Inference and reliablity. Ann. Rev. Genet. (1988) 22:521–565.[CrossRef][Web of Science][Medline]

    Felsenstein J. Inferring phylogenies (2004) Sauderland, Massachusetts: Sinauer Associates.

    Fort G., Moulines E. Convergence of the Monte Carlo expectation maximization for curved exponential families. Ann. Stat. (2003) 31:1220–1259.[CrossRef][Web of Science]

    Galassi M., Davies J., Theiler J., Gough B., Jungman G., Booth M., Rossi F. Gnu scientific library: Reference manual (2003) 2nd edition. Network Theory, Ltd.

    Gascuel O. Mathematics and evolution and phylogeny (2005) New York: Oxford University Press.

    Gelman A. Parameterization and Baysian modeling. J. Am. Stat. Assoc. (2004) 99:537–545.[CrossRef][Web of Science]

    Gelman A., Carlin J. B., Stern H. S., Rubin D. B. Bayesian data analysis (2004) Chapman and Hall/CRC.

    Guindon S., Gascuel O. A simple, fast, and accurate algorithm to estimate large phylogenies by maximum likelihood. Syst. Biol. (2003) 52:696–704.[Abstract/Free Full Text]

    Hastings W. K. Monte Carlo sampling methods using Markov chains and their applications. Biometrika (1970) 57:97–109.[Abstract/Free Full Text]

    Huelsenbeck J. P., Jain S., Frost S. W. D., Pond S. L. K. A Dirichlet process model for detecting positive selection in protein-coding DNA sequences. Proc. Natl. Acad. Sci. USA (2006) 103:6263–6268.[Abstract/Free Full Text]

    Jaynes E. Probability theory (2003) Cambridge, UK: The logic of science. Cambridge University Press.

    Jeffreys H. Some tests of significance, treated by the theory of probability. Proc. Camb. Phil. Soc. (1935) 31:203–222.[CrossRef]

    Jensen J. L., A.-M. k. Pedersen. Probabilistic models of DNA sequence evolution with context dependent rates of substitution. Adv. Appl. Prob. (2000) 32:499–517.[CrossRef]

    Jones D. T., Taylor W. R., Thornton J. M. A new approach to protein fold recognition. Nature (1992) 358:86–89.[CrossRef][Medline]

    Kass R., Raftery A. Bayes factors and model uncertainty. J. Am. Stat. Assoc. (1995) 90:773–795.[CrossRef][Web of Science]

    Kirkpatrick S., Gelatt C. D., Vecchi M. P. Optimization by simulated annealing. Science (1983) 220:671–680.[Abstract/Free Full Text]

    Kleinman C. L., Rodrigue N., Bonnard C., Philippe H., Lartillot N. A maximum likelihood framework for protein design. BMC Bioinformaics (2006) 7:326.[CrossRef]

    Kolaczkowski B., Thornton J. W. Performance of maximum parsimony and likelihood phylogenetics when evolution is heterogeneous. Nature (2004) 431:980–984.[CrossRef][Medline]

    Lartillot N. Conjugate sampling for phylogenetic models. J. Comput. Biol. (2006) 13:1701–1722.[CrossRef][Web of Science][Medline]

    Lartillot N., Brinkmann H., Philippe H. Suppression of long branch attraction artefacts in the animal phylogeny using a site-heterogeneous model. BMC Evol. Biol. (2007) 7((Suppl. 1)):S4.[CrossRef][Medline]

    Lartillot N., Philippe H. A Bayesian mixture model for across-site heterogeneities in the amino-acid replacement process. Mol. Biol. Evol. (2004) 21:1095–1109.[Abstract/Free Full Text]

    Lartillot N., Philippe H. Computing Bayes factors using thermodynamic integration. Syst. Biol. (2006) 55:195–207.[Abstract/Free Full Text]

    Levine R. A., Casella G. Implementations of the Monte Carlo EM algorithm. J. Comput. Graph. Stat. (2001) 10:422–439.[CrossRef][Web of Science]

    Lewis S. M., Raftery A. E. Estimating Bayes factors via posterior simulation with the Laplace-Metropolis estimator. J. Am. Stat. Assoc. (1997) 92:648–655.[CrossRef][Web of Science]

    Liu C., Rubin D. B., Wu Y. N. Parameter expansion to accelerate EM: The PX-EM algorithm. Biometrika (1998) 85:755–770.[Abstract/Free Full Text]

    Liu J., Wu Y. N. Parameter expansion for data augmentation. J. Am. Stat. Assoc. (1999) 94:1264–1274.[CrossRef][Web of Science]

    Mateiu L., Rannala B. Inferring complex DNA substitution processes on phylogenies using uniformization and data augmentation. Syst. Biol. (2006) 55:259–269.[Abstract/Free Full Text]

    Mayrose I., Friedman N., Pupko T. A gamma mixture model better accounts for among site rate heterogeneity. Bioinformatics (2005) 21((Suppl. 2)):ii151–ii158.[Abstract]

    Metropolis S., Rosenbluth A. W., Rosenbluth M. N., Teller A. H., Teller E. Equation of state calculation by fast computing machines. J. Chem. Phys. (1953) 21:1087–1092.[CrossRef]

    Miyazawa S., Jernigan R. L. Estimation of effective interresidue contact energies from protein crystal structures: Quasi-chemical approximation. Macromolecules (1985) 18:534–552.[CrossRef][Web of Science]

    Nielsen R. Mapping mutations on phylogenies. Syst. Biol. (2002) 51:729–739.[Abstract/Free Full Text]

    Nielsen R. Statistical methods in molecular evolution (2005) New Work: Springer.

    Nourani Y., Andresen B. A comparison of simulated annealing cooling strategies. J. Phys. A: Math. Gen. (1998) 31:8373–8385.[CrossRef]

    Paap R. What are the advantages of MCMC based inference in latent variable models? Stat. Neerl. (2002) 56:2–22.[CrossRef]

    Pagel M., Meade A. A phylogenetic mixture model for detecting pattern-heterogeneity in gene sequence or character-state data. Syst. Biol. (2004) 53:561–581.

    Pal C., Papp B., Lercher M. J. An integrated view of protein evolution. Nat. Rev. Genet. (2006) 7:337–348.[CrossRef][Web of Science][Medline]

    Parisi G., Echave J. Structural constraints and emergence of sequence patterns in protein evolution. Mol. Biol. Evol. (2001) 18:750–756.[Abstract/Free Full Text]

    Pedersen A.-M. K., Jensen J. L. A dependent rates model and MCMC based methodology for the maximum likelihood analysis of sequences with overlapping reading frames. Mol. Biol. Evol. (2001) 18:763–776.[Abstract/Free Full Text]

    Philippe H., Delsuc F., Brinkmann H., Lartillot N. Phylogenomics. Annu. Rev. Ecol. Evol. Syst (2005) 36:541–562.[CrossRef]

    Pond S. K., Muse S. V. Site-to-site variation of synonomous substitution rates. Mol. Biol. Evol. (2005) 22:2375–2385.[Abstract/Free Full Text]

    Raftery A. E. Approximate Bayes factors and accounting for model uncertainty in generalised linear models. Biometrika (1996) 83:251–266.[Abstract/Free Full Text]

    Ren F., Tanaka H., Yang Z. An empirical examination of the utility of codon substitution models in phylogeny reconstruction. Syst. Biol. (2005) 54:808–818.[Abstract/Free Full Text]

    Robert C. P., Casella G. Monte Carlo statistical methods (2004) New York: Springer.

    Robinson D. M., Jones D. T., Kishino H., Goldman N., Thorne J. L. Protein evolution with dependence among codons due to tertiary structure. Mol. Biol. Evol. (2003) 18:1692–1704.

    Rodrigue N., Lartillot N., Bryant D., Philippe H. Site interdependence attributed to tertiary structure in amino acid sequence evolution. Gene (2005) 347:207–217.[CrossRef][Web of Science][Medline]

    Rodrigue N., Philippe H., Lartillot N. Assessing site-interdependent phylogenetic models of sequence evolution. Mol. Biol. Evol. (2006) 23:1762–1775.[Abstract/Free Full Text]

    Semple C., Steel M. Phylogenetics (2003) New York: Oxford University Press.

    Seno F., Micheletti C., Martian A. Variational approach to protein design and extraction of interaction potentials. Phys. Rev. Lett. (1998) 81:2172–2175.[CrossRef][Web of Science]

    Shakhnovich E. I., Gutin A. M. Engineering of stable and fast-folding sequences of model proteins. Proc. Natl. Acad. Sci. USA. (1993) 90:7195–7199.[Abstract/Free Full Text]

    Sippl M. J. Calculation of conformational ensembles from potentials of mean force; an approach to the knowledge-based prediction of local structure in globular proteins. J. Mol. Biol. (1990) 213:859–883.[Web of Science][Medline]

    Spencer M., Susko E., Roger A. J. Likelihood, parsimony, and heterogeneous evolution. Mol. Biol. Evol. (2005) 22:1161–1164.[Abstract/Free Full Text]

    Steel M. Should phylogenetic models be trying to "fit an elephant"? Trends Genet. (2005) 21:3007–309.

    Stone M. Cross-validatory choice and assessment of statistical predictions. J. R. Stat. Soc. B (1974) 36:111–147.

    Sun S., Bren R., Chan R., Dill K. Designing amino acid sequences to fold with good hydrophobic cores. Protein Eng. (1995) 8:1205–1213.[Abstract/Free Full Text]

    Susko E., Field C., Blouin C., Roger A. J. Estimation of rates-across-sites distributions in phylogenetic substitution models. Syst. Biol. (2003) 52:625–636.

    Tanner M. A., Wong W. H. The calculation of posterior distirbutions by data augmentation. J. Am. Stat. Assoc. (1987) 82:528–550.[CrossRef][Web of Science]

    Tierney L., Kadane J. B. Accurate approximations for posterior moments and marginal distributions. J. Am. Stat. Assoc. (1986) 81:82–86.[CrossRef][Web of Science]

    va Dyk D. A., Meng X. L. The art of data augmentation. J. Comput. Graph. Stat. (2001) 10:1–50.[CrossRef][Web of Science]

    Wei G. C. G., Tanner M. A. A Monte Carlo implementation of the EM algorithm and the poor man's data augmentation algorithms. J. Am. Stat. Assoc. (1990) 85:699–704.[CrossRef][Web of Science]

    Whelan S., Goldman N. A general empirical model of protein evolution derived from multiple protein families using a maximum-likelihood approach. Mol. Biol. Evol. (2001) 18:691–699.[Abstract/Free Full Text]

    Yang Z. Maximum-likelihood estimation of phylogeny from DNA sequences when substitution rates differ over sites. Mol. Biol. Evol. (1993) 10:1396–1401.[Abstract]

    Yang Z. Maximum likelihood phylogenetic estimation from DNA sequences with variable rates over sites: approximate methods. J. Mol. Evol. (1994) 39:306–314.[CrossRef][Web of Science][Medline]

    Yang Z. PAML: A program package for phylogenetic analysis by maximum likelihood. CABIOS (1997) 13:555–556.[Medline]

    Yang Z. Computational Molecular Evolution (2006) New York: Oxford University Press.

    Yang Z., Nielsen R., Goldman N., A.-M. K. Pedersen. Codon-substitution models for heterogeneous selection pressure at amino acid sites. Genetics (2000) 155:431–449.[Abstract/Free Full Text]

    Yu J., Thorne J. L. Dependence among sites in RNA evolution. Mol. Biol. Evol. (2006a) 23:1525–1537.[Abstract/Free Full Text]

    Yu J., Thorne J. L. Testing for spatial clustering of amino acid replacements within protein tertiary structure. . Mol. Evol. J. (2006b) 62:682–692.[CrossRef]


Add to CiteULike CiteULike   Add to Connotea Connotea   Add to Del.icio.us Del.icio.us    What's this?


This article has been cited by other articles:


Home page
Mol Biol EvolHome page
N. Rodrigue, C. L. Kleinman, H. Philippe, and N. Lartillot
Computational Methods for Evaluating Phylogenetic Models of Coding Sequence Evolution with Dependence between Codons
Mol. Biol. Evol., July 1, 2009; 26(7): 1663 - 1676.
[Abstract] [Full Text] [PDF]


This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (PDF) Freely available
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in ISI Web of Science
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrow Search for citing articles in:
ISI Web of Science (3)
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Rodrigue, N.
Right arrow Articles by Lartillot, N.
Right arrow Search for Related Content
PubMed
Right arrow Articles by Rodrigue, N.
Right arrow Articles by Lartillot, N.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?