Skip Navigation

Systematic Biology 2007 56(6):975-987; doi:10.1080/10635150701670569
This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (PDF) Freely available
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in ISI Web of Science
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrow Search for citing articles in:
ISI Web of Science (3)
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Huelsenbeck, J. P.
Right arrow Articles by Suchard, M. A.
Right arrow Search for Related Content
PubMed
Right arrow Articles by Huelsenbeck, J. P.
Right arrow Articles by Suchard, M. A.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?

© 2007 Society of Systematic Biologists

A Nonparametric Method for Accommodating and Testing Across-Site Rate Variation

Edited by Thomas Buckley: Associate Editor

John P. Huelsenbeck1 and Marc A. Suchard2,3,4

1 Department of Integrative Biology, University of California Berkeley, CA, 94720, USA E-mail: johnh{at}berkeley.edu
2 Department of Biomathematics, David Geffen School of Medicine at UCLA Los Angeles, CA, 90095, USA
3 Department of Human Genetics, David Geffen School of Medicine at UCLA Los Angeles, CA, 90095, USA
4 Department of Biostatistics, UCLA School of Public Health Los Angeles, CA, 90095, USA


    Abstract
 Top
 Abstract
 Methods
 Results and Discussion
 Appendix 1
 Acknowledgments
 References
 
Substitution rates are one of the most fundamental parameters in a phylogenetic analysis and are represented in phylogenetic models as the branch lengths on a tree. Variation in substitution rates across an alignment of molecular sequences is well established and likely caused by variation in functional constraint across the genes encoded in the sequences. Rate variation across alignment sites is important to accommodate in a phylogenetic analysis; failure to account for across-site rate variation can cause biased estimates of phylogeny or other model parameters. Traditionally, rate variation across sites has been modeled by treating the rate for a site as a random variable drawn from some probability distribution (such as the gamma probability distribution) or by partitioning sites to different rate classes and estimating the rate for each class independently. We consider a different approach, related to site-specific models in which sites are partitioned to rate classes. However, instead of treating the partitioning scheme in which sites are assigned to rate classes as a fixed assumption of the analysis, we treat the rate partitioning as a random variable under a Dirichlet process prior. We find that the Dirichlet process prior model for across-site rate variation fits alignments of DNA sequence data better than commonly used models of across-site rate variation. The method appears to identify the underlying codon structure of protein-coding genes; rate partitions that were sampled by the Markov chain Monte Carlo procedure were closer to a partition in which sites are assigned to rate classes by codon position than to randomly permuted partitions but still allow for additional variability across sites.

Keywords: Across-site rate variation; Bayesian estimation; Dirichlet process prior; Markov chain Monte Carlo

Received November 15, 2006; Revised January 28, 2007; Accepted June 7, 2007


Sequence substitution rates depend upon the rate at which new alleles enter a population through mutation and the rate at which these alleles are fixed, replacing the other alleles that were previously present in the population. Importantly, the substitution rate can vary across an alignment of molecular sequences for a number of biological reasons. Functional constraints, for one, can differ on an alignment site-by-site basis, causing variation in the rate of substitution; sites that are more constrained by natural selection show fewer substitutions than sites that are less constrained or for which natural selection favors a nucleotide that is different than the one most frequently found in the population. Moreover, because the substitution rate depends upon two process—the mutation process and the fixation process—the substitution rate can vary across a sequence if the mutation rate varies. Finally, substitution rates can vary across an alignment of sequences for the simple reason that coalescence histories can vary across the sequences. Under a population genetics model such as the coalescence process with recombination, the history of coalescence can vary across sequences, with recombination events marking the boundaries between sites with different histories. If the amount of time represented by each coalescent history varies (e.g., if the time to the most recent common ancestor for each history varies), then substitution rates can vary across the sequences even if no selection is acting on the sequences and even if the mutation rate remains constant across the sequences.

For phylogenetic problems, the consensus of biologists favors variation in functional constraint across sequences as the main cause of among-site rate variation. The concern that the coalescence history can vary across the sequence is usually discounted for most phylogenetic studies. Indeed, a basic assumption of almost all phylogenetic analyses is that the same phylogeny underlies all of the sequences, so variation in rates caused by differences in coalescence histories is discounted a priori as a major cause of among-site rate variation. The assumption that the phylogenetic history, at least, is common to all of the sites in a phylogenetic analysis is probably a good one in most cases, because most phylogenetic analyses operate on different species that are separated by enough time that processes such as lineage sorting can be discounted. Variation in mutation rates, although documented in population studies (e.g., Arndt et al., 2005), is also thought to play a back-seat role to varying functional constraint as a cause of among-site variation in substitution rate. In this study, we concentrate on among-site variation in substitution rate that is caused by variation in functional constraints or, to the extent that it might occur, systematic variation in mutation rate. The types of among-site rate variation models we are concerned with affect all of the species at a particular site in the same way. We do not consider models such as the covariotide model, in which functional constraints can change over the evolutionary history for a particular site (Fitch and Markowitz, 1970; Tuffley and Steel, 1998; Galtier, 2001; Huelsenbeck, 2002). In this study, a high-rate site is always a high-rate site, remaining so over the entire history represented by the phylogeny.

Regardless of the cause of among-site variation in rates of substitution, it is critical to account for such variation in a phylogenetic analysis. Failure to do so can result in incorrect inference of phylogeny (Huelsenbeck and Hillis, 1993) as well as biased estimates of other model parameters (Wakeley, 1994). Along with the topology of the phylogeny, substitution rates are a critical component of all model-based phylogenetic analyses and appear as the branch lengths on a phylogenetic tree. Among-site rate variation is accommodated by assuming that the relative rate for the ith site, ri, varies across the sequences according to some underlying parametric model. The ultimate effect of among-site rate variation models is to multiply all of the branches of a tree for site i by the rate parameter ri. In this manner, the branch length proportions are maintained on a site-by-site basis; a high-rate site, for example, has a tree with branch lengths that are all proportionally larger than a low-rate site.

Two general approaches have been pursued to model among-site rate variation. The first approach treats the relative rate for site i as a random variable drawn from a common across-sites gamma probability distribution (Uzzell and Corbin, 1971; Nei et al., 1976; Jin and Nei, 1990; Yang, 1993), a log normal probability distribution (Olsen, 1987), an inverse Gaussian probability distribution (Waddell and Steel, 1997), or a probability distribution in which some proportion of the sites remain invariable (Hasegawa et al., 1987). Mixtures of both invariant sites and randomly distributed rates are also successful. Gu et al. (1995) provide the first example of phylogenetic analysis using a mixture of a proportion of invariable sites and gamma-distributed rates model. Waddell and Steel (1997) consider extensions of the Gu et al. (1995) approach, including inverse Gaussian distributed rates. Yang (1995) considered a hidden Markov model of among-site rate variation in which the marginal rate at a site is gamma distributed but in which adjacent sites have potentially correlated rates, accounting for the observation that some genes harbor regions of high or low substitution rate. Finally, Kosakovsky-Pond and Frost (2005) considered a general discrete distribution for among-site rate variation and considered methods for allowing the parameters of such a model to be reliably estimated. This model is perhaps the most flexible model of among-site rate variation considered to date. The second approach for modeling among-site rate variation groups sites into classes and estimates the relative rate of substitution independently for each class. For protein-coding DNA sequences, sites are commonly grouped by codon position. The relative substitution rates are then estimated for the first-, second-, and third-codon position categories. Typically, one finds that third position sites have the highest substitution rate and second position sites have the lowest substitution rate; because of the redundancy of the genetic code, third position sites are more able to vary without changing the function of the protein. Although site-specific models like the one just described are usually applied to protein-coding DNA with partitioning being first-, second-, and third-codon positions, other ways of grouping sites into classes can also be applied. The most extreme partitioning scheme is to assign each site its own rate class (see Swofford et al., 1996; Nielsen, 1997); this means that the relative rate of substitution is independently estimated for each site in the sequences. Although on first inspection, such a model seems ideal—after all, each site probably does differ from other sites, at least slightly, in its overall rate of substitution—a site-specific model in which each site has an independently estimated rate parameter has poor statistical properties (Felsenstein, 2004). One must remain vigilant of the bias-variance trade-off that adding and removing parameters from statistical models engenders; completely independent site rates is one example of model overfitting. Occam's Razor suggests that the optimal model contains the fewest parameters requisite to appropriately account for the variation in the data.

We describe an alternative model of among-site rate variation that is akin to the site-specific model with different rates at each site but employs Occam's Razor as a guiding principle to randomly construct a parsimonious description of the rate variation process. Specifically, we consider the partitioning scheme, in which sites are assigned to rate categories, to be a random variable with a prior probability distribution that is described by the Dirichlet process prior model. The Dirichlet process prior has been used with increasing frequency in a Bayesian framework for modeling cases in which the data elements are drawn from a mixture of an unknown number of probability distributions. The Dirichlet process prior model allows the number of mixture components as well as the assignment of individual data elements to mixture components to vary. In the context of rate variation across sites, the Dirichlet process prior model places non-zero probability on an equal rates model (which occurs when all of the sites are assigned to the same rate class), some probability on the extreme case in which each site is placed in a rate class by itself, as well as probability on all of the other possible partitioning schemes that assign sites to rate classes. A priori, we can analytically calculate these weights and then compare them to a posteriori estimates to select among different rate variation models in a formal statistical framework.

The Dirichlet process prior is often described in context to the "Chinese restaurant problem," a description that provides an intuitive explanation of the process using a hypothetical example of seating patrons in a restaurant. One imagines a queue of patrons waiting outside the entrance of a Chinese restaurant. The restaurant, besides serving wonderful Chinese cuisine, has a countably infinite number of tables (and each table can seat an infinite number of people). The patrons enter the restaurant one at a time. The first patron sits at some arbitrary table. Obviously this occurs with probability one. The next patron can either sit at the same table as the first, with probability 1/1+{chi}, or at an unoccupied table with remaining probability x/1+ {chi}. In general, patron i sits at occupied table k with probability {eta}k/i + {chi} or at an unoccupied table with probability {chi}/i + {chi}. ({eta}k is the number of people currently sitting at table k.) After all of the patrons have entered the restaurant, the patrons will occupy some number of tables K, and the occupancy of the tables can be noted. Importantly, the probability of the number of tables and the specific pattern of people sitting at tables can be calculated, and this probability does not depend upon the order in which patrons entered the restaurant. For the purposes of this paper, we replace "patrons" with the word alignment "site." Sites that "sit at the same table" all share a common relative substitution rate. There are other ways to describe the Dirichlet process prior (e.g., a stick-breaking description for the proportion of data elements that are assigned to each cluster). Besides being used on a limited basis in phylogenetics (Lartillot and Philippe, 2004; Huelsenbeck et al., 2006) and population genetics, where the sampling procedure underlies the Ewens' sampling formula (Ewens et al., 1998), the Dirichlet process prior has been of general use in Bayesian analysis of mixture models.


    Methods
 Top
 Abstract
 Methods
 Results and Discussion
 Appendix 1
 Acknowledgments
 References
 
Our description of the Dirichlet process prior model for among-site rate variation involves many parameters. We provide a list of most of the parameters used in this paper in Table 1. Throughout, we use f(·) or g(·) to describe a probability or probability density/mass function. The identity of the probability distribution should be clear from its arguments.


View this table:
[in this window]
[in a new window]

 
Table 1. A description of most of the parameters used in this study.

 
Data
We assume that the biologist has properly aligned DNA sequences from S species. The alignment is represented as an S x N matrix of nucleotides, Y, where N is the length of the sequences in the alignment.

The following provides an example of an alignment of S = 4 sequences, each N = 10 nucleotides in length:


Formula 1

(1)
Each column of the alignment is called a site; this example alignment has N = 10 sites, which can be represented as the column vectors: y1= (TTTT)t, y2= (TTCG)t, y3= (TTTT)t, y4= (TTTT)t, y5= (CCCC)t, etc.

Phylogenetic Model
Our phylogenetic model consists of a tree representing the genealogical relationships of the species and a model of character change that describes how the characters (nucleotides in this case) change over evolutionary time on the phylogenetic tree to produce the observed data Y at the tree's S tips. We imagine that all of the possible phylogenetic trees have been labeled, {tau}1, {tau}2, {tau}3, ..., {tau}B(S), where B(S) is the number of possible trees for S species. In this study, we use a time-reversible model of character change (which will be described in more detail below); as a consequence, we consider only unrooted phylogenetic trees. The unrooted trees that are inferred in this paper can be rooted using other criteria, such as the outgroup criterion. There are a total of B(S) = (2S–5)!! possible unrooted phylogenetic trees for S species (Schróder, 1870).

Every unrooted phylogenetic tree has 2S – 3 branches. Ideally, the lengths of the branches are represented in terms of real time; for instance, the time between speciation events on the tree. However, because it is difficult to infer the branch lengths on a phylogenetic tree in terms of time, these lengths are usually measured in terms of the number of substitutions per site that are expected to occur along each branch. We denote the collections of branch lengths as t= (t1, ..., t2S–3). The tree length is the sum of the branch lengths: T = {sum}b = 12S–3tb.

We assume that nucleotides change along the tree according to a continuous-time Markov chain. This is a typical assumption of phylogenetic analysis. A continuous-time Markov chain can be completely described by knowledge of the instantaneous rates of change between nucleotides. Here, we assume that substitutions occur according to the general time-reversible (GTR) model of DNA substitution, which has instantaneous rates:


Formula 2

(2)
and was first described by Tavaré (1986). The entry in the uth row and vth column of the matrix specifies the infinitesimal rate of change from nucleotide u to nucleotide v. The diagonal entries of Q, here shown with a dash, are specified such that each row sums to zero. The frequency of nucleotide u under stationarity of the substitution process is denoted {pi}u. The commonly used DNA substitution models all have the unusual feature that the stationary frequencies of the nucleotides are built directly into the rate matrix. The sum of the four base frequencies must, of course, equal one. We also impose the constraint that the sum of the six rate parameters equals one: {theta}AC+ {theta}AG+ {theta}AT+ {theta}CG+ {theta}CT+ {theta}GT= 1. To ensure that branch lengths on the tree are expressed in terms of expected number of substitutions per site, the substitution rate matrix must be rescaled such that the expected substitution rate is one for a unit branch length. This is achieved by setting µ = –1/{sum}u isin (A,C,G,T){pi}uquu. It also means that we cannot estimate the absolute rates of substitution, only their relative rates. The substitution rate parameters are contained in the vector {theta} = ({theta}AC, {theta}AG, {theta}AT, {theta}CG, {theta}CT, {theta}GT). Similarly, the nucleotide frequency parameters are contained in the vector {pi} = ({pi}A, {pi}C, {pi}G, {pi}T).

Likelihood
With the addition of two more assumptions—that substitutions at different sites and along different branches are independent of one another—we can calculate the likelihood. The likelihood is proportional to the probability of the observed data, Y, conditional on the unknown parameters of the model ({tau},t, {theta}, {pi}). We use the pruning algorithm described by Felsenstein (1981) to analytically calculate the likelihood f(yi| {tau}, t, {theta}, {pi}) of each site i in the alignment.

Assuming that substitutions are independent across sites, the likelihood of the complete alignment is the product of the site likelihoods:


Formula 3

(3)
With the likelihood function in hand, parameter estimation can be performed using the method of maximum likelihood or using Bayesian inference.

Across-Site Rate Variation
To this point, our description of the phylogenetic model assumed that substitution rates were equal across sites. Rate variation across sites has been introduced into phylogenetic models using two general strategies. The first method a priori partitions the sites into different categories, and then estimates the rates independently for each partition. Such a priori partitioned models are often called site-specific models. We identify these models by appending "+SS" to the substitution model name (e.g., GTR+SS). For a quintessential example of a site-specific model, imagine that the sequence alignment derives from protein-coding DNA. A common partitioning scheme for protein-coding DNA is to categorize sites by codon position. All of the first position sites are assumed to share a rate multiplier r1. Similarly, all second position sites share a rate multiplier r2and third position sites have the rate multiplier r3. To ensure that branch lengths on the tree remain expressed in terms of expected number of substitutions per site, the rate multipliers are constrained such that their mean rate is one; that is, the rates follow the constraint that (r1n1+ r2n2+ r3n3)/n = 1, where nkfor k = 1,2,3 is the number of sites assigned to partition k. The likelihood under a site-specific model is straightforward to calculate. In essence, the branch lengths are all multiplied by the rate parameter ri, where {sigma}i is the fixed mapping from site i to partition {sigma}i= 1, 2, or 3 depending on whether i is at a first-, second-, or third-codon position, respectively. The likelihood for the site i becomes f(yi| {tau}, t x r{sigma}i, {theta}, {pi}). The rate multiplier r{sigma}ifunctions as a scaling factor for all of the branch lengths: t x r{sigma}i= (t1r{sigma}i, ..., t2S–3r{sigma}i). Under the site-specific model, all branches are affected in the same way; all branches for a given site expand or contract proportionally. This proportional effect on all branches is also true for other models for among-site rate variation.

The other general strategy for accommodating rate variation across sites is to treat the rate at a given site as a random variable drawn from some common across-sites parameteric probability distribution. Two models are commonly used—the proportion of invariable sites model and the gamma model. For the proportion of invariable sites model, the rate multiplier is r = 0 with probability p and is r = 1/(1–p) with probability 1–p (Hasegawa et al., 1987). The likelihood for site i is then integrated over the two possibilities:f(yi| {tau}, t, {theta}, {pi}, p) = p x f(yi| {tau}, t x 0, {theta}, {pi}) + (1–p) x f[yi| {tau}, t/(1–p),{theta},{pi}].

For the gamma model, the rate multipliers riare assumed to be drawn independently and identically distributed (iid) from a gamma probability distribution (Yang, 1993). Each rate multiplier, then, has prior probability density


Formula 4

(4)
for ri> 0. The gamma function {Gamma}(z) = {int}0{infty}tz–1etdt. For natural numbers, {Gamma}(z) = (z–1)! The probability distribution f(· | {alpha}, β) has mean {alpha}/β and variance {alpha}2. Again, to ensure that the branch lengths remain expressed in terms of expected number of substitutions per site, the probability distribution is chosen such that the mean rate of substitution is one. This is achieved by fixing {alpha} = β. The likelihood for site i is calculated by marginalizing over all possible rate multiplier values


Formula 5

(5)
In practice, it is difficult to calculate the integral necessary to average over all rates for the gamma model. Instead, the gamma distribution is arbitrarily discretized into an a priori fixed number of categories, and the mean or median realized value for each category is used to represent all of the possible rates in that category (Yang, 1994). Hence, the integration over all possible rates simplifies to a practical summation over the rate categories. Similar to the nomenclature for the site-specific model, we specify the proportion of invariable sites and gamma among-site rate variation models by appending "+I" and "+{Gamma}," respectively, to the substitution model name. Many phylogenetic studies use a mixture of the site-specific, proportion of invariable sites, and gamma rate variation models. Using mixtures often provides a significant improvement of the fit of the phylogenetic model to the observed data.

Bayesian Inference of Phylogeny
Bayesian inference of phylogeny is based upon the posterior probability distribution of the model parameters. The posterior distribution is the probability mass function of the model parameter conditional on the data. Via Bayes theorem,


Formula 6

(6)
where f(· | Y) is the posterior probability distribution of the parameters,f(Y | ·) is the likelihood, f(·) is the prior probability distribution, and f(Y)is the marginal likelihood of the data (Mau, 1996; Li, 1996; Rannala and Yang, 1996; Mau and Newton, 1997; Yang and Rannala, 1997; Larget and Simon, 1999; Mau et al., 1999; Newton et al., 1999). We have already described how the likelihood is calculated. Bayesian analysis introduces a prior probability distribution on the parameters of the phylogenetic model. The prior probability distribution represents the biologist's beliefs about the parameter(s) before collecting the observations. The following represent commonly assummed priors on the phylogenetic parameters and is the default condition in the program MrBayes (Huelsenbeck and Ronquist, 2001; Ronquist and Huelsenbeck, 2003):


Formula 7

(7)
That is to say, all phylogenetic trees are assumed to be equally probable a priori. Moreover, the branch lengths on each tree are assumed to be iid exponentially distributed random variables with mean 1/{lambda} (here, we set {lambda} = 10), the nucleotide frequencies are assumed to be drawn from a flat Dirichlet probability distribution, and the rate parameters of the substitution model are assumed to be random variables also drawn from a flat Dirichlet probability distribution. In the above specification, we have not included among-site rate variation model parameters, such as the proportion of invariable sites parameter P or the gamma shape parameter {alpha}. We consider these priors further after the development of an alternative parameteriation that naturally lends itself to our novel among-site rate variation model formulation.

A Dirichlet Process Model for Across-Site Rate Variation
There is an alternative parameterization of the phylogenetic model that is exactly equivalent to the scheme given above. This alternative parameterization relies on the following facts about exponential, gamma, and Dirichlet probability distributions. First, the sum of M exponential({lambda}) random variables is a gamma probability distribution with parameters {alpha} = M and β = {lambda}. To see this, the exponential distribution has probability density f(x | {lambda}) = {lambda} e{lambda} x. Summing M exponential random variables generates an analytically tractable convolution integral, leading to


Formula 8

(8)
where y = x1+ x2+ ... + xM. Second, imagine dividing each of the M exponential random variables by their gamma-distributed sum. Doing this for the mth exponential random variable with realized value xmgenerates {Phi}m= xm/y, where y is the realized sum of the M exponential random variables. Random variable {varphi}m is the proportion of the sum represented by xm. The joint probability distribution for ({varphi}1,...,{varphi}M) follows a flat Dirichlet probability distribution. Hence, an alternative parameterization of our prior model in Equation (7) is


Formula 9

(9)
where {varphi}b is the proportion of total tree length T allocated to branch b: specifically, tb = {varphi}bx T. To complete the specification, let {varphi} = ({varphi}1, ..., {varphi}2S–3).

Building upon the parameterization in Equation (9), let Tiequal the site-specific tree length for site i. A model without rate variation restricts T1= ... = TN= T. Here, we nonparametrically relax this assumption and allow the tree lengths to differ across sites according to a Dirichlet process prior model. Each site i exists in one of K rate categories, where K is a priori unknown and each rate category differs in its tree length T[k]for k isin {1,...,K}. To inform the assignment of sites to rate categories, we return to the vector of mappings {sigma} = ({sigma}1,...,{sigma}N); however, under the Dirichlet process prior, the mappings are random with {sigma}iisin {1,...,K}. For example, for N = 10 sites, one possible assignment vector is


Formula 10

(10)
This mapping vector assigns all sites in the same rate category and represents a model with equal rates across sites. For illustration, some other possible mapping vectors include


Formula 11

(11)
The final partitioning, above, shows the least parsimonious partitioning scheme, in which each site falls into a different rate class. These vectors represent a very small minority of all possible partitionings. For example, N = 10 sites yields a total of 115,975 ways to partition sites to rate classes. In this paper, we label partitions according to the restricted growth function notation of Stanton and White (1986); elements are sequentially numbered with the constraint that the index numbers for two sites are the same if they are found in the same rate category. If K is fixed, then the total number of ways to partition N sites among K categories is given by the Stirling numbers of the second kind:


Formula 12

(12)
Under the Dirichlet process prior model, K is random and can range from one to N. Therefore, the total number of ways to partition N sites among rate categories is a sum of the Stirling numbers of the second kind. This sum is called the Bell number (Bell, 1934):


Formula 13

(13)
The number of possible ways to assign sites to rate categories can become very large. For example, for sequences of length N = 1000, there are a total of ß1000{approx} 3 x 101927 possible ways to assign sites to rate categories!

The Dirichlet process prior treats both the number of rate clases (K) and the assignment of sites to rate classes ({sigma}) as random variables (Antoniak, 1974; Ferguson, 1973). With this in mind, the Dirichlet process prior model for rate variation across sites can be formally described as follows. First, K and {sigma} are both drawn a priori from the probability distribution


Formula 14

(14)
where {eta}kis the number of sites assigned to rate category k and {chi} is the "concentration" parameter for the Dirichlet process prior model. After the sites have been assigned to K rate classes, a tree length T[k]is assigned to each category k by drawing T[k]from a gamma probability distribution with shape 2S–3 and scale {lambda}.

The paramount parameter of the Dirichlet process prior model is its concentration parameter {chi}. Generally speaking, {chi} determines how clumpy the process becomes. If {chi}is small, then the Dirichlet process prior model tends to favor fewer classes. In fact, the prior probability of K categories is


Formula 15

(15)
where S1(·,·) is the absolute value of the Stirling numbers of the first kind. One obtains Equation (15) by integrating Equation (14) over all possible partitions {sigma} for K categories. The prior probability of two sites i1and i2finding themselves grouped together into the same rate class is


Formula 16

(16)
This formulation clearly shows that when {chi} is small, two sites are more likely to find themselves in the same rate class than when {chi} is large.

Markov Chain Monte Carlo
We wish to infer the parameters of the phylogenetic model in a Bayesian framework. We cannot feasibly calculate the posterior probability distribution of the parameters analytically, because doing so involves large summations and integrals that cannot be solved analytically. We resort to Markov chain Monte Carlo (MCMC) to approximate the posterior probability distribution of the phylogenetic model parameters (Metropolis et al., 1953; Hastings, 1970). The general goal is to construct a Markov chain whose state space reflects the parameter space of the phylogenetic model and has a stationary distribution that is the posterior probability distribution of interest. Samples then taken from this Markov chain at stationarity are valid, albeit dependent, samples from the posterior probability distribution of the phylogenetic parameters (Tierney, 1994). We base inference on the samples taken from the Markov chain. For example, the fraction of the time a specific phylogenetic tree is sampled estimates the marginal posterior probability of that tree.

We implement a variant of MCMC employing the Metropolis-Hastings algorithm. The general idea is to (1) propose a new state (combination of model parameters) for the Markov chain; (2) calculate the probability of accepting the proposed state as the next state of the Markov chain; and (3) accept or reject the proposed state according to the acceptance probability calculated in step 2. This procedure is repeated many times. Green (2003) describes a flexible method for constructing complex proposal mechanisms that keeps the calculation of the acceptance probability simple. Specifically, a new state X' is proposed as the next state of the Markov chain. The generation of the new state involves the generation of possibly many random numbers u from an arbitrary probability distribution gu). The new state is a deterministic function of the random numbers and the original state x of the Markov chain, x' = h x, u). The reverse move from x' to x—a move that is not actually made in computer memory—is imagined through another set of random numbers u' drawn from a possibly different probability distribution g'(u'), where x = h'(x', u'). The probability of accepting the proposed state x' as the next state of the Markov chain is


Formula 17

(17)
The last factor is called the Jacobian J of the transform to x' and u' with respect to x and u; it is introduced to balance the change of variables that may occur when proposing new states for the Markov chain. In Appendix 1, we provide details on our parameter proposals. Of particular interest are the proposals to update the number of rate categories K and site rate class assignments {sigma}. We make available source code for the DPP rate-variation model by request to interested readers.

Data Analysis
We analyzed four alignments of protein-coding DNA sequences: (1) an alignment of β-globin sequences sampled from S = 17 vertebrates (N = 432; Yang et al., 2000); (2) an alignment of cytochrome oxidase I (COI) sequences sampled from S = 15 gopher species (N = 379; Hafner et al., 1994); (3) an alignment of COI sequences sampled from S = 17 louse species (N = 379; Hafner et al., 1994); and (4) an alignment of cytochrome b sequences sampled from S = 31 mammalian species (N = 1140; Larget and Simon, 1999).

All analyses were performed under the general time-reversible (GTR) model of DNA substitution. We analyzed the data under five different rate models; we implemented an equal rates GTR, GTR+SS, and GTR+{Gamma} models for among-site rate variation, as well as a mixture model (GTR+SS+{Gamma}) and the Dirichlet process prior model (GTR+DPP). The Dirichlet process prior requires that the concentration parameter {chi} realizes some value. One approach, not pursued here, is to place a hyperprior on the concentration parameter. Often, a gamma prior is placed on {chi}. The approach we pursue is to choose {chi} such that the prior mean of the number of rate categories is E(K) = 3, E(K) = 5, E(K) = 10, and E(K) = 20. Doing this allows us to investigate the robustness of results to choice of {chi}. The specific values of {chi} used to achieve these expectations are shown in Table 2.


View this table:
[in this window]
[in a new window]

 
Table 2. Concentration parameters {chi} of the Dirichlet process prior. For four prior expected number of rate categories EK, we report the requisite {chi}.

 
We analyzed each alignment two times, running a Markov chain for a total of one million cycles for each analysis. Updates of the allocation vector were attempted about 10% of the time. Each update of the allocation vector involved scanning all sites, attempting to reassign each site to a rate class. Hence, the total number of MCMC updates was much larger than one million (e.g., there were over 100 million updates performed for each analysis of the mammalian cytochrome b alignment). We discarded samples taken during the first half of a million cycles as the burn-in for the Markov chain before it reaches stationarity.


    Results and Discussion
 Top
 Abstract
 Methods
 Results and Discussion
 Appendix 1
 Acknowledgments
 References
 
Model Choice
We compared the fit of the Dirichlet process prior model for among-site rate variation to several alternative models for accommodating rate variation. For the four data sets examined in this study, the Dirichlet process prior model fitted the data substantially better than the alternative models. Table 3 shows the estimated log marginal likelihoods for five rate-variation models (Equal, +{Gamma}, +SS, +SS+{Gamma}, and +DPP) for all four alignments. The marginal likelihood for model M{ell}is obtained by integrating the model likelihood over all possible combinations of model parameters, weighing each combination by its prior probability under model {ell}. In a Bayesian framework, model choice is guided by comparing marginal likelihoods. The marginal likelihood of a model is the likelihood of the data integrated against all possible parameter values under the prior distribution. Specifically, their ratio is the Bayes factor (BF) in favor of the model in the numerator against the model in the denominator. We estimated the marginal likelihoods using the harmonic mean estimator of Newton and Raftery (1994). This estimator is consistent, although it can be adversly affected by rare posterior samples with small likelihood values. In situations where the differences between marginal likelihoods are more modest, more efficient BF estimation techniques are available; these include the Savage-Dickey ratio for nested models (Suchard et al., 2001), reversible-jump Markov chain Monte Carlo to sample the joint parameter space for competing models (Green, 1995), and bridge sampling via thermodynamic integration (Lartillot and Philippe, 2006). We obtain the log BFs by taking the differences between log marginal likelihoods in Table 3. In this light, BFs >> 1 or << 1 still offer strong conclusions, whereas BFs {approx} 1 should be viewed with caution. After the DPP models, the next best model for all four data sets is either the SS or SS+{Gamma} model.


View this table:
[in this window]
[in a new window]

 
Table 3. The log marginal likelihoods for the different rate-variation models examined in this study. The subscript in DPPiindicates the prior mean of the number of classes for the Dirichlet process prior model for among-site rate variation.

 
Marginal Posterior Probability Distribution of Rates
To examine the nonparametric behavior of rate variation, we examined the marginal posterior probability density of the rates for each alignment site. Figure 1 shows the marginal posterior probability distribution of the tree length T for five arbitrarily chosen sites taken from the vertebrate β-globin alignment. The column vectors containing the information at sites 5, 30, 66, 156, and 341 are


Formula 18

(18)
and are variant for all sites except site 5. The sequence characters in these data columns fall in the order: human, tarsier, bushbaby, hare, rabbit, cow, sheep, pig, elephant seal, rat, mouse, hamster, marsupial, duck, chicken, Xenopus laevis, and Xenopus tropicalis. The marginal posterior probability distribution of T is often multimodal. For example, for site 30, the probability distribution of T has two or three modes under the Dirichlet process prior model. These multiple modes are real; they are repeatably obtained from MCMC analyses that start from different random combinations of parameters. The marginal distribution is unimodal for the equal-rate and site-specific-rate models; in fact, under the equal-rate model, the probability distribution for all sites is identical (as it must be under an equal-rate model). Similarly, the marginal probability distribution of T is identical for all sites of the same codon position for the site-specific model. We implemented the gamma model for among-site rate variation with four discrete rate categories. This implementation detail is reflected in the marginal probability distribution of rates under the models that assume gamma-distributed rate variation (i.e., the +{Gamma} and +SS+{Gamma} models), where there are four modes, reflecting the mean rate for each rate category.


Figure 1
View larger version (99K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Figure 1 The marginal posterior probability density of the tree length T for five vertebrate β-globin sites under eight different rate variation models.

 
Similar to the gamma model, the Dirichlet process prior model for among-site rate variation also relies on discrete classes of rates. However, the assignment of sites to specific classes is now a random variable, as are the realized rate values for each class. Which rate class a site finds itself in depends not only upon the distribution of nucleotides among species for that site (e.g., the pattern of nucleotides at the site) but also upon the details of the tree and branch lengths.

Although the marginal posterior probability distribution for rates under the Dirichlet process prior model, and indeed for the other rate variation models, appears to be reliably approximated using the MCMC procedure, one should take care not to overinterpret the distributions. The marginal posterior probability distribution of rates is likely to be strongly influenced by details of the prior model and implementation. For example, the four modes in the marginal distribution of rates under the gamma model results because we implemented the gamma model with exactly four discrete categories (Yang, 1994), not for any biological reason. Similarly, the Dirichlet process prior model has a discrete number of rate classes to describe how rates vary across the sequences. The multiple modes in the marginal posterior probability distribution for rates under the Dirichlet process prior is likely caused by the fact that the model has a discrete number of rate classes; the rate class a site finds itself assigned to is a compromise between the optimal rate for that site and how well the rate for other sites can be explained by the rate class.

Codon Structure
All of the alignments examined in this study were of protein-coding DNA sequences. Common practice a priori partitions these data by codon position. Our modeling representation for codon partitioning is


Formula 19

(19)
A comparison of the marginal likelihoods (Table 3) shows that a model with rates partitioned according to codon position (the +SS model in this paper) does not explain the data as well as a model in which the partition is considered a random variable (the +DPP model of this paper). Although the codon partition does not fit as well as other partitions that were sampled during the course of the MCMC analysis, one can ask whether the codon partition is in some sense closer to the sampled partitions than, say, a random partition of the sites.

We explore the possibility that the partitions sampled under the Dirichlet process prior model are "closer" to the traditional codon partition than a random partitioning of the sites. We use a distance on partitions described by Gusfield (2002). Gusfield's distance between two partitions d({sigma}1, {sigma}2)is the minimum number of elements that need to be deleted from the partitions to make the induced partitions equal. Equivalently, the distance between partitions is the minimum number of elements that must be moved from one cluster to another to make the partitions the same. During the course of the MCMC analysis under the Dirichlet process prior model, a large number of partitions are sampled. These sampled partitions represent the current state of the Markov chain, describing how sites at that chain step are partitioned to rate classes. We will label the sampled partitions {sigma}(p)for p = 1,...,P, the total number of sampled partitions. For each sampled partition, we also create a randomly permuted version, in which the elements are randomly assigned to clusters. This permutation procedure maintains the number of rate categories for the partition, but randomly assigns sites to rate classes. For example, consider the sampled partition


Formula 20

(20)
Possible permutations of this partition include


Formula 21

(21)
We label Formula (p)as a permutation of the original sample {sigma}(p). For each of the four alignments, we calculate the statistic


Formula 22

(22)
where I(·) is the indicator function. This statistic is the fraction of the time the sampled partition is closer to the codon partition than a random permutation of the partition; for a long MCMC chain, this fraction converges to the probability that the data partitioning is closer to codon partitioning than random.

Table 4 shows the results of our analysis. The sampled partitions are closer to the codon partition than are random permutations of the sampled partitions. For example, about 97% to 99% of the sampled partitions for the alignments of COI for gophers and lice are closer to the codon partition than a random partition of the sites. This suggests that the Dirichlet process prior model we implemented appropriately captures the codon structure present in the data without a priori specification. In addition to codon partitioning, the Dirichlet process prior model also appropriately identifies the differences in rates across codon sites. Table 5 summarizes the mean of the marginal posterior probability distribution of tree length T for first-, second-, and third-codon positions. As expected, the second position sites have the lowest rate of substitution and the third position sites have the highest rate.


View this table:
[in this window]
[in a new window]

 
Table 4. The probability that the Dirichlet process prior partitionings are closer to a codon partitioning than random permutations.

 


View this table:
[in this window]
[in a new window]

 
Table 5. The marginal posterior mean of the tree length T at first, second, and third codon positions. The mean tree length for the codon positions are formatted as "(first, second, third)."

 
Nonparametric Across-Site Rate Variation
The Dirichlet process prior model for among-site rate variation is similar in many ways to the site-specific model. The main difference between the Dirichlet process prior and the site-specific models concerns how the sites are partitioned to rate classes. For the site-specific model, the biologist must determine the partitioning scheme before the analysis; the partitioning scheme that is chosen, whether it partitions sites by codon position or some other biologically relevant aspect of the data, becomes an inescapable assumption of the analysis. Under the Dirichlet process prior model, on the other hand, the partitioning scheme is considered to be a random variable. Our method is also similar to the mixture models proposed by Pagel and Meade (2004, 2005), with the main difference being the question addressed (substitution rate variation versus variation in the rate matrix of the continuous-time Markov chain model of character change) and the prior probability distribution on partitions.

Our MCMC implementation of the Dirichlet process prior model for among-site rate variation depends upon algorithm 8 of Neal (2000). The MCMC appears to perform well for the four data sets we explore. However, the potential space of partitioning schemes which the MCMC chain explores is huge; for any particular data set, our implementation may fail. Other proposal mechanisms can be implemented for this problem, including some that involve split-and-merge proposals, or "sweetened" split-and-merge proposals, in which sites in a cluster are either split into two clusters or the elements in two clusters are merged into one (Jain and Neal, 2004, 2005).

The Dirichlet process prior model for among-site rate variation explains the pattern of rate variation in the four data sets we analyzed better than any other existing model. We do not know, of course, whether this will remain true for other data sets. However, we are hopeful that the Dirichlet process prior model will continue to do well for future data sets, as the Dirichlet process prior is very flexible. Importantly, the Dirichlet process prior allows sites with similar rates to be grouped together into the same rate class. This is not true for commonly used models for among-site rate variation. For example, consider a site-specific model, with sites partitioned by codon position. All sites at the same codon position are treated the same, regardless of any additional patterns of variation at the sites (Buckley et al., 2001); an invariant third-codon site is grouped together with third-codon position sites for which all four nucleotides are observed. This problem can be reduced if one used a mixture of the site-specific and gamma rate-variation models. However, even under the SS+{Gamma} model, considerable structure remains in the position rates. The Dirichlet process prior model captures this additional variation easily.

The Dirichlet process prior model we describe can be improved in a number of ways. For example, under the current model, the tree length for a rate class is a random variable drawn from a gamma(2S–3, {lambda}) probability distribution. However, the current implementation makes it difficult to accommodate sites that have very low rates (e.g., invariant sites). It should be possible to augment tree lengths with a strictly invariant class. Here, we consider the tree length as a random variable with length equal to zero with probability P and drawn from a gamma(2S–3, {lambda}) probability distribution with probability 1–P. This would represent a mixture of the proportion of invariable sites and Dirichlet process prior models. Employing a Dirichlet process mixture model may allow a better fit to patterns of among-site rate variation (MacEachern and Müller, 1998).


    Appendix 1
 Top
 Abstract
 Methods
 Results and Discussion
 Appendix 1
 Acknowledgments
 References
 
Here we describe in detail the MCMC proposal mechanisms for parameters in the Dirichlet process prior model for among-site rate variation developed in this paper.

Tree
We use a variant of the "LOCAL" proposal mechanism of Larget and Simon (1999; also see Holder et al., 2005) to propose new trees. The proposal mechanism works as follows. First, an internal branch of the phylogenetic tree is chosen at random. Also chosen at random are two branches, incident to each end of the randomly chosen internal branch. Together, these random choices result in three contiguous branches. Figure 2 shows the area of rearrangement and uses a notation similar to that in Holder et al. (2005). The three branches that were randomly chosen are on the path between nodes a and c, and are referred to as the "backbone" of the move. These three branches represent a proportion of the total tree length Formula 1= {varphi}ai+ {varphi}ij+ {varphi}ic. The branches outside of the backbone, those not randomly chosen, have a proportion of the tree length Formula = 1 – Formula 1. For the example shown in Figure 1, which has only four tips, Formula 2= {varphi}ib+ {varphi}jd. Second, we draw new values for {varphi}1' and {varphi}2' from a beta probability distribution with parameters Formula 1{psi}1and {varphi}2{psi}1, centered at the original proportions with tunable precision {psi}1.Finally, one of the two branches that are incident to the backbone, either the branch ib or the branch jd is chosen at random and detached from the backbone. That branch is then randomly reattached to the backbone, along the modified path that now has proportion {varphi}'ac.


Figure 2
View larger version (7K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Figure 2 The area of rearrangement for the modified LOCAL proposal mechanism. The branch length proportions are denoted {varphi} and are changed by the LOCAL proposal.

 
The modified LOCAL proposal mechanism involves the generation of six random variables: (1) u1, the random choice of internal branch; (2) u2, the random choice of one of the two branches incident to one end of the randomly chosen internal branch; (3) u3, the random choice of one of the two branches incident to the other end of the randomly chosen internal branch; (4) u4, the new value for Formula , which is drawn from a beta probability distribution; (5) u5, the random choice of one of the two branches that are incident to the backbone (the three branches chosen with u1, u2, and u3); and (6) u6, the random proportion of the distance from node a to node c that the branch chosen using u5now falls. The probability/probability density of these six random variables is as follows:


Formula 23

(23)
As in Holder et al. (2005), we assume without loss of generality that branch jd is chosen to randomly move along the backbone. Moreover, we will follow the parameterization suggested by Holder et al. (2005) and measure path lengths from node a on the phylogenetic tree of Figure 2. Also for illustrative purposes, we drop dependence on the substitution process parameters temporarily. Consider the original state x = ({varphi}ai, {varphi}aj, {varphi}ac, {varphi}ib, {varphi}jd)and the proposed state x'= ({varphi}'ai, {varphi}'aj, {varphi}'ac, {varphi}'ib{varphi}', {varphi}'jd. We obtain the imagined random variables u'= (u'1, u'2, u'3, u'4, u'5, u'6) to return to the original state from the six drawn random variables u = (u1, u2, u3, u4, u5, u6), via


Formula



Formula 24

(24)

The Jacobian is the absolute value of the matrix of partial derivatives and is


Formula 25

(25)
and the acceptance probability for our modification of the LOCAL tree proposal mechanism becomes



Formula 26

(26)

The prior ratio is one, because all unrooted phylogenetic trees have equal probability and all combinations of branch length proportions have equal prior probability under a flat Dirichlet prior.

Base frequencies
We propose new base frequencies by randomly drawing from a Dirichlet({psi}2x {pi}) where {psi}2is a tunable precision parameter. The acceptance probability for a proposal of new nucleotide frequencies is


Formula 27

(27)

Again, the prior ratio is one because all combinations of nucleotide frequencies have equal probability under a flat Dirichlet prior.

Substitution rates
We use the same proposal mechanism for substitution rate proportions as we do for nucleotide frequencies but with tunable precision parameter {psi}3. All other aspects of the move are similar, including the acceptance probability.

Rate classes
We implemented a Gibbs sampling procedure with auxiliary components (Neal, 2000, algorithm 8) to update the allocation vector {sigma}. The method works as follows. First, we choose random site i and remove it from the current set of K rate classes. Assume that {sigma}i = k. If site i is in a rate class by itself, then we remove that rate class from the set of all classes, decreasing K by one. Otherwise, {eta}kdecreases by one. Let K(–i) denote the number of rate classes assigned to rate class k that remain after the removal of site i. We label the tree lengths for these classes c1, ..., cK(–i), T[1], ..., T[K(–i)]. We then construct {kappa} auxiliary rate classes, with the rate for each class freshly drawn from the prior on the tree length T. We label these auxiliary tree lengths T[K(–i)+1], ..., T[K(–i)+ {kappa}]. In the Gibbs portion of the step, we now reassign site i to new rate class k' with probability


Formula 28

(28)
where b is an easy-to-calculate normalizing constant. After this update, we discard any rate classes not associated with at least one site. We implement this Gibbs sampling component-wise across all N sites to arrive at a new K and {sigma}. For the analyses in this paper, we chose {kappa} = 5.

We also implemented a proposal mechanism that changes the tree length T[k]for each rate class k using a random rate multiplier.

The new tree length proposal T'[k]= T[k]e{psi}4(u–1/2), where u is a uniform(0,1) random variable and {psi}4is a tuning parameter.The acceptance probability for this proposal is


Formula 29

(29)
We found that the MCMC worked well when the adjustable tuning parameters of the proposal mechanisms took the following values: {psi}1= 100, {psi}2= 100, {psi}3= 100, and {psi}4= log e(1.1).


    Acknowledgments
 Top
 Abstract
 Methods
 Results and Discussion
 Appendix 1
 Acknowledgments
 References
 
J.P.H. was supported by NSF grant DEB-0445453 and NIH grant GM-069801. M.A.S. was supported by NIH grant GM-068955.


    References
 Top
 Abstract
 Methods
 Results and Discussion
 Appendix 1
 Acknowledgments
 References
 

    Antoniak C. E. Mixtures of Dirichlet processes with applications to non-parametric problems. Ann. Stat. (1974) 2:1152–1174.[CrossRef][Web of Science]

    Arndt P. F., Hwa T., Petrov D. A. Substantial regional variation in substitution rates in the human genome: Importance of GC content, gene density, and telomere-specific effects. J. Mol. Evol. (2005) 6:748–763.

    Bell E. T. Exponential numbers. Am. Math. Monthly (1934) 41:411–419.[CrossRef]

    Buckley Simon C., Chambers G. K. Exploring among-site variation models in a maximum likelihood framework using empirical data: Effects of model assumptions on estimates of topology, branch lengths, and bootstrap support. Syst. Biol. (2001) 50:67–86.[Abstract/Free Full Text]

    Ewens W. J., Tavaré S. The Ewens sampling formula. In: Encyclopedia of statistical science—Kotz S., Read C. B., Banks D. L., eds. (1998) New York: Wiley. 230–234.

    Felsenstein J. Evolutionary trees from DNA sequences: A maximum likelihood approach. J. Mol. Evol. (1981) 17:368–376.[CrossRef][Web of Science][Medline]

    Felsenstein J. Inferring phylogenies (2004) Sunderland, Massachusetts: Sinauer Associates.

    Ferguson T. S. A Bayesian analysis of some nonparametric problems. Ann. Stat. (1973) 1:209–230.[CrossRef][Web of Science]

    Fitch W. M., Markowitz E. An improved method for determining codon variability in a gene and its application to the rate of fixation of mutations in evolution. Biochem. Genet. (1970) 4:579–593.[CrossRef][Web of Science][Medline]

    Galtier N. Maximum-likelihood phylogenetic analysis under a covarion-like model. Mol. Biol. Evol. (2001) 18:866–873.[Abstract/Free Full Text]

    Green P. J. Reversible jump Markov chain Monte Carlo computation and Bayesian model determination. Biometrika (1995) 82:711–732.[Abstract/Free Full Text]

    Green P. J. Trans-dimensional Markov chain Monte Carlo. In: Highly structured stochastic systems—Green P. J., Hjort N. L., Richardson S., eds. (2003) Oxford, UK: Oxford University Press. 179–198.

    Gu Y., Fu Y. X., Lu W. H. Maximum likelihood estimation of the heterogeneity of substitution rates among nucleotide sites. Molecular Biology and Evolution (1995) 12:546–557.[Abstract]

    Gusfield D. Partition-distance: A problem and class of perfect graphs arising in clustering. Inform. Process. Lett. (2002) 82:159–164.[CrossRef]

    Hafner M. S., Sudman P. D., Villablanca F. X., Spradling T. A., Demastes J. W., Nadler S. A. Disparate rates of molecular evolution in cospeciating hosts and parasites. Science (1994) 265:1087–1090.[Abstract/Free Full Text]

    Hasegawa M., Kishino H., Yano T. Man's place in Hominoidea as inferred from molecular clocks of DNA. J. Mol. Evol. (1987) 26:132–147.[CrossRef][Web of Science][Medline]

    Hastings W. K. Monte Carlo sampling methods using Markov chains and their applications. Biometrika (1970) 57:97–109.[Abstract/Free Full Text]

    Holder M. T., Lewis P. O., Swofford D. L., Larget B. Hastings ratio of the LOCAL proposal used in Bayesian phylogenetics. Syst. Biol. (2005) 54:961–965.[Free Full Text]

    Huelsenbeck J. P. Testing a covariotide model of DNA substitution. Mol. Biol. Evol. (2002) 19:698–707.[Abstract/Free Full Text]

    Huelsenbeck J. P., Hillis D. M. Success of phylogenetic methods in the four-taxon case. Syst. Biol. (1993) 42:247–270.[Abstract/Free Full Text]

    Huelsenbeck J. P., Jain S., Frost S. W. D., Kosakovsky Pond S. L. A Dirichlet process model for detecting positive selection in protein-coding DNA sequences. Proc. Nat. Acad. Sci. USA (2006) 103:6263–6268.[Abstract/Free Full Text]

    Huelsenbeck J. P., Ronquist F. MrBayes: Bayesian inference of phylogenetic trees. Bioinformatics (2001) 17:754–755.[Abstract/Free Full Text]

    Jain S., Neal R. A split-merge Markov chain Monte Carlo procedure for the Dirichlet process mixture model. J. Comput. Graph. Stat. (2000) 13:158–182.

    Jain S., Neal R. Splitting and merging components of a nonconjugate Dirichlet process mixture model (2005) Department of Statistics, University of Toronto. Technical Report 0507.

    Jin L., Nei M. Limitations of the evolutionary parsimony method of phylogenetic analysis. Mol. Biol. Evol. (1990) 7:82–102.[Abstract]

    Kosakovsky Pond S. L., Frost S. D. W. A simple hierarchical approach to modeling distributions of substitution rates. Mol. Biol. Evol. (2005) 22:223–234.[Abstract/Free Full Text]

    Larget B., Simon D. Markov chain Monte Carlo algorithms for the Bayesian analysis of phylogenetic trees. Mol. Biol. Evol. (1999) 16:750–759.[Web of Science]

    Lartillot N., Philippe H. A Bayesian mixture model for across-site heterogeneities in the amino-acid replacement process. Mol. Biol. Evol. (2004) 21:1095–1109.[Abstract/Free Full Text]

    Lartillot N., Philippe H. Computing Bayes factors using thermodynamic integration. Syst. Biol. (2006) 55:195–207.[Abstract/Free Full Text]

    Li S. Phylogenetic tree construction using Markov chain Monte carlo (1996) Columbus: Ohio State University. PhD dissertation.

    MacEachern S. N., Müller P. Estimating mixtures of Dirichlet process models. J. Comput. Graph. Stat. (1998) 7:223–238.[CrossRef][Web of Science]

    Mau B. Bayesian phylogenetic inference via Markov chain Monte Carlo methods (1996) Madison: University of Wisconsin. PhD dissertation.

    Mau B., Newton M. Phylogenetic inference for binary data on dendrograms using Markov chain Monte Carlo. J. Comput. Graph. Stat. (1997) 6:122–131.[CrossRef][Web of Science]

    Mau B., Newton M., Larget B. Bayesian phylogenetic inference via Markov chain Monte Carlo methods. Biometrics (1999) 55:1–12.[CrossRef][Web of Science][Medline]

    Metropolis N., Rosenbluth A. W., Rosenbluth M. N., Teller A. W., Teller E. Equations of state calculations by fast computing machines. J. Chem. Phys. (1953) 21:1087–1091.[CrossRef]

    Neal R. M. Markov chain sampling methods for Dirichlet process mixture models. J. Comput. Graph. Stat. (2000) 9:249–265.[CrossRef][Web of Science]

    Nei M., Chakraborty R., Fuerst P. A. Infinite allele model with varying mutation rate. Proc. Nat. Acad. Sci. USA (1976) 73:4164–4168.[Abstract/Free Full Text]

    Newton M. A., Raftery A. E. Approximate Bayesian inference by the weighted likelihood bootstrap. J. R. Stat. Soc. Ser. B (1994) 56:3–48.

    Newton M., Mau B., Larget B. Markov chain Monte Carlo for the Bayesian analysis of evolutionary trees from aligned molecular sequences. In: Statistics in molecular biology—Seillier-Moseiwitch F., Speed T. P., Waterman M., eds. (1999) Hayward, California: IMS. 143–162. Monograph Series of the Institute of Mathematical Statistics.

    Nielsen R. Site-by-site estimation of the rate of evolution and the correlation of rates in mitochondrial DNA. Syst. Biol. (1997) 46:346–353.[Abstract/Free Full Text]

    Olsen G. J. Earliest phylogenetic branchings: Comparing rRNA-based evolutionary trees inferred with various techniques. Cold Spring Harbor Symp. Quant. Biol. (1987) 52:825–837.[Abstract/Free Full Text]

    Pagel M., Meade A. A phylogenetic mixture model for detecting pattern-heterogeneity in gene sequence or character-state data. Sys. Biol. (2004) 53:571–581.[CrossRef]

    Pagel M., Meade A. Mixture models in phylogenetic inference. In: Mathematics of evolution and phylogeny—Gascuel O., ed. (2005) Oxford, UK: Oxford University Press. 121–142.

    Rannala B., Yang Z. Probability distribution of molecular evolutionary trees: A new method of phylogenetic inference. J. Mol. Evol. (1996) 43:304–311.[Web of Science][Medline]

    Ronquist F., Huelsenbeck J. P. MrBayes 3: Bayesian phylogenetic inference under mixed models. Bioinformatics (2003) 19:1572–1574.[Abstract/Free Full Text]

    Schröder E. Vier combinatorische Probleme. Z. Math. Phys. (1870) 15:361–376.

    Stanton D., White D. Constructive combinatorics (1986) New York: Springer-Verlag.

    Swofford D. L., Olsen G. J., Waddell P. J., Hillis D. M. Phylogenetic inference. In: Molecular systematics—Hillis D. M., Moritz C., Mable B. K., eds. (1996) 2nd edition. Sunderland, Massachusetts: Sinauer Associates. 407–514.

    Tavaré S. Some probabilistic and statistical problems on the analysis of DNA sequences. In: Lectures in mathematics in the life sciences (1986) volume 17:57–86.

    Tierney L. Markov chains for exploring posterior distributions. Ann. Stat. (1994) 22:1701–1762.[CrossRef][Web of Science]

    Tuffley C., Steel M. Modeling the covarion hypothesis of nucleotide substitution. Math. Biosci. (1998) 147:63–91.[CrossRef][Web of Science][Medline]

    Uzzell T., Corbin K. W. Fitting discrete probability distributions ot evolutionary events. Science (1971) 172:1089–1096.[Abstract/Free Full Text]

    Waddell P. J., Steel M. A. General time reversible distances with unequal rates across sites. Mol. Phylogenet. Evol. (1997) 8:398–414.[CrossRef][Web of Science][Medline]

    Wakeley J. Substitution rate variation among sites and the estimation of transition bias. Mol. Biol. Evol. (1994) 11:436–442.[Abstract]

    Yang Z. Maximum likelihood estimation of phylogeny from DNA sequences when substitution rates differ over sites. Mol. Biol. Evol. (1993) 10:1396–1401.[Abstract]

    Yang Z. Maximum likelihood phylogenetic estimation from DNA sequences with variable rates over sites: Approximate methods. J. Mol. Evol. (1994) 39:306–314.[CrossRef][Web of Science][Medline]

    Yang Z. A space-time process model for the evolution of DNA sequences. Genetics (1995) 139:993–1005.[Abstract]

    Yang Z., Rannala B. Bayesian phylogenetic inference using DNA sequences: A Markov chain Monte carlo method. Mol. Biol. Evol. (1997) 14:717–724.[Abstract]

    Yang Z, Nielsen R., Goldman N., Pedersen A. Codon-substitution models for heterogeneous selection pressure at amino acid sites. Genetics (2000) 155:431–449.[Abstract/Free Full Text]


Add to CiteULike CiteULike   Add to Connotea Connotea   Add to Del.icio.us Del.icio.us    What's this?


This article has been cited by other articles:


Home page
Syst BiolHome page
E. W. Bloomquist and M. A. Suchard
Unifying Vertical and Nonvertical Evolution: A Stochastic ARG-based Framework
Syst Biol, November 9, 2009; (2009) syp076v1.
[Abstract] [Full Text] [PDF]


Home page
BioinformaticsHome page
N. Lartillot, T. Lepage, and S. Blanquart
PhyloBayes 3: a Bayesian software package for phylogenetic reconstruction and molecular dating
Bioinformatics, September 1, 2009; 25(17): 2286 - 2288.
[Abstract] [Full Text] [PDF]


Home page
BioinformaticsHome page
L. Si Quang, O. Gascuel, and N. Lartillot
Empirical profile mixture models for phylogenetic reconstruction
Bioinformatics, October 15, 2008; 24(20): 2317 - 2323.
[Abstract] [Full Text] [PDF]


This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (PDF) Freely available
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in ISI Web of Science
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrow Search for citing articles in:
ISI Web of Science (3)
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Huelsenbeck, J. P.
Right arrow Articles by Suchard, M. A.
Right arrow Search for Related Content
PubMed
Right arrow Articles by Huelsenbeck, J. P.
Right arrow Articles by Suchard, M. A.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?