Skip Navigation

Systematic Biology 2008 57(5):665-674; doi:10.1080/10635150802422274
This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (PDF) Freely available
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in ISI Web of Science
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Kim, J.
Right arrow Articles by Sanderson, M. J.
Right arrow Search for Related Content
PubMed
Right arrow Articles by Kim, J.
Right arrow Articles by Sanderson, M. J.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?

© 2008 Society of Systematic Biologists

Penalized Likelihood Phylogenetic Inference: Bridging the Parsimony-Likelihood Gap

Junhyong Kim1 and Michael J. Sanderson2

1 Department of Biology and Penn Genome Frontiers Institute, University of Pennsylvania Philadelphia, Pennsylvania, 19104, USA; E-mail: junhyong{at}sas.upenn.edu
2 Department of Ecology and Evolutionary Biology, University of Arizona Tucson, Arizona 85721, USA; E-mail: sanderm{at}email.arizona.edu


    Abstract
 Top
 Abstract
 Discussion
 Acknowledgments
 References
 
The increasing diversity and heterogeneity of molecular data for phylogeny estimation has led to development of complex models and model-based estimators. Here, we propose a penalized likelihood (PL) framework in which the levels of complexity in the underlying model can be smoothly controlled. We demonstrate the PL framework for a four-taxon tree case and investigate its properties. The PL framework yields an estimator in which the majority of currently employed estimators such as the maximum-parsimony estimator, homogeneous likelihood estimator, gamma mixture likelihood estimator, etc., become special cases of a single family of PL estimators. Furthermore, using the appropriate penalty function, the complexity of the underlying models can be partitioned into separately controlled classes allowing flexible control of model complexity.

Keywords: Model selection; penalized likelihood; phylogeny estimation; semi-parametric

Received October 23, 2007; Revised February 8, 2008; Accepted July 7, 2008


"Establishing ... robustness (or disproving it) by examining a wider range of models is a daunting task, but it must be undertaken."

   Felsenstein, 1978:409

Datasets used to reconstruct phylogenies are becoming increasingly complex. Not only is the size of trees increasing, but the number and heterogeneity of loci being used to infer phylogenies is also increasing steadily (Philippe et al., 2005; Jeffroy et al., 2006; Sanderson, 2007). As sequencing technology has become faster and cheaper and genomic tools have become more readily available, datasets with hundreds to thousands of loci have been analyzed (e.g., Rokas et al., 2003; Tehler et al., 2003). The wholesale shift of empirical phylogenetics over to molecular inference (save for fossil taxa), combined with increased understanding of the dynamics of molecular evolution, have driven phylogenetic methods inexorably toward probabilistic model frameworks with increasing model complexity (Huelsenbeck et al., 2001; Felsenstein, 2004). To some degree this has relegated methods like parsimony- and distance-based tree reconstruction to contexts in which their computational speed provides an obvious advantage, as with very large datasets (Driskell et al., 2004). Nonetheless, the race between the accumulation of large biological sequence datasets with increasing complexity and the ability of modeling to take this complexity into account has continued to quicken. In this paper we develop two main themes: first, that it is possible to add biological complexity to phylogenetic inference in a controlled fashion that limits the added model complexity to desired levels; and second, that doing so can provide insights into connections between parsimony, likelihood, and Bayesian inference.

Nonparametric statistical approaches make comparatively few explicit assumptions about the probability distribution of random variables or parameters but often capture extraordinarily complex patterns in data. Nonparametric regression is a good example (Green and Silverman, 1994). However, strictly nonparametric approaches often suffer from lack of power and it is often difficult to formally assess their properties. Semi-parametric methods attempt to bridge the pros and cons of parametric and nonparametric methods. One semi-parameteric inference framework is penalized likelihood (PL; Green and Silverman, 1994; Simonoff, 1994; Eggermont and Lariccia, 2001), which offers some of the advantage (and disadvantages) of each. The general strategy in PL is to combine a parameter-rich formulation of a complex model with an ad hoc penalty function that controls the variability of the distribution (often called "regularity" or "smoothness") of these parameters, without using a very detailed specification of their distribution. Penalized likelihood has been used in regression and density estimation, and in biological applications such as image analysis and morphometrics (Glasbey and Mardia, 2001; Park 2007), and clustering SNPs (Fujisawa et al., 2004). Sanderson (2002) used this approach to model variable rates of evolution on a fixed tree in the divergence time problem. There the model allows every branch of a tree to have its own arbitrary rate, but a penalty is imposed that downweights large differences in rate between neighboring branches.

The goal of a semi-parametric approach is to imbue a parametric model with sufficient complexity to capture the biological complexity emerging in the data but to then limit this model complexity by making only minimal assumptions about distribution. Essentially this overparameterizes the problem first, then backs off of this overparameterization dramatically with some simple assumptions about smoothness (see Discussion). Previous work has laid the groundwork for the overparameterization step. Tuffley and Steel (1997) described a no-common-mechanism (ncm) model, which has a separate rate parameter for every site x branch pair in the dataset. We can begin with that model and impose a penalty on these rates that increases with the variance in rates across sites, compartmentalized into terms for each branch. Because Tuffley and Steel showed that maximum-likelihood (ML) inference in the ncm model is equivalent to maximum-parsimony (MP) inference, we can establish a continuous spectrum of estimators ranging from parsimony to likelihood. In fact, we can begin with a generalization of the ncm model (see below) and embed MP in even a larger family. Moreover, because there are formal connections between penalized likelihood and Bayesian inference that have already been described in the phylogenetic divergence time case (Thorne and Kishino, 2005), connections to Bayesian tree inference can be made as well.

Penalized Likelihood Estimator for Phylogeny
Consider a phylogenetic tree with t taxa. We will use the notation {theta}i(e) to denote the parameters of a stochastic model on branch e of the tree for the ith character. For a k-state character with the standard finite-state Markov model of evolution, {theta}i(e) might refer to the collection of parameters for the transition matrix corresponding to the branch e; that is, {theta}i(e) = Mi(e) for some k by k matrix M. We will also use Xi to denote the ith character over a t-taxon tree. For example, for binary state characters and a four-taxon tree, we might have Xi = (0,0,1,1), meaning that the first two taxa have the character state 0, whereas the third and fourth taxa have character state 1 for this ith character. We will also use the notation {Theta}i = {{theta}i(e1),{theta}i(e2) ... {theta}i(eb)} to denote the collection of branch transition models over a tree with b number of branches for theith character.

For some observed character Xi, its likelihood can be easily computed from {Theta}i and the tree topology in the standard way by summing the transition probabilities over the set of possible ancestral character states (Felsenstein, 1981). We will use the notation {varphi}i = P(Xi/{Theta}i) to denote the likelihood of the ith character. Adopting the Tuffley-Steel formulation, we define a ncm (no-common mechanisms) model as a model with a base continuous-time Markov process that allows every character to have an arbitrary length on each branch. Using the above notation, the log-likelihood for the ncm model is


Formula 1

(1)
for n characters. In Equation (1) {varphi}i is a function of a collection of matrices {Theta}i = {{theta}i(e1), c,{theta}i(eb)}, where {theta}i(ej) = exp [Qti(ej)], where Q is the Jukes-Cantor rate matrix and ti(ej) is the length of the ej branch for the ith character.

Compared to the ncm model, the standard likelihood estimator under a Markov process without rate variation assumes that the parameters for every character are identical; or equivalently, that the characters are samples from an independent identical distribution (iid sampling). To distinguish from certain among-site rate variation (ASRV) models such as the gamma mixture model that are also iid, we will call this the "standard iid" model. For the standard iid model the probability of a particular character is a function of a common set of transition matrices for all characters. For this standard iid model, the position of any character in the data matrix is exchangeable with any other position and therefore the probability of the whole dataset is a function of the frequency of each kind of terminal state pattern. (We note that many other models have this positional exchangeability.) Using the notation {phi}j = P(Xj/{Theta}) for the probability of the jth terminal state pattern, the log-likelihood for the standard iid model is


Formula 2

(2)
where fj denotes the absolute frequency of the jth terminal state pattern in the data and w is the total number of possible terminal state patterns (= kt for a k-state t-taxon tree). In Equation (2), {phi}j is a function of a single set of Markov transition matrices per branch that are in turn functions of a single set of branch lengths and a single continuous time Markov model with the Jukes-Cantor rate matrix Q for all j patterns.

The no-common-mechanisms (ncm) model can be forced to become the standard iid model if we constrain the branch parameters to be identical for each character; that is, if we force {theta}i(e) = {theta}j(e) for all i,j characters and each branch e. This motivates the idea then that the ncm model and the Jukes-Cantor iid model can be connected together in a single one-parameter family of models using a penalized log-likelihood formulation. Consider the objective function,


Formula 3

(3)
where {lambda} is a tuning parameter that controls the relative contribution of the likelihood and the penalty terms, and var[{theta} (e)] denotes the variance in site-specific parameters for the branch e, appropriately computed for the type of parameter. For example, if Formula matrix we can compute


Formula 4

(4)
where mjki (e) is the j, kth element of the transition matrix on branch e for the ith site (character). That is, the average element-wise variance per branch across sites. Therefore, the penalty term sums up each of the variances in site-specific parameters over the branches of the tree. The main utility of the penalty term is to specify a function that will have increasing positive values when the among-site variation in the parameter values increases; we discuss some alternative penalty schemes below.

Given the PL({lambda}) (penalized log-likelihood) objective function, if the parameter {lambda} is set to zero, then the resulting function is identical to the log-likelihood of the ncm model. Conversely, suppose we let {lambda} go to infinity. Then, any positive value of the penalty term will be greatly magnified and the maximum value will only exist on the subset of parameters such that var[{theta} (e)] = 0. That is, the subset of parameters where each character has identical parameters for each branch. Thus, when {lambda} goes to infinity, the maximum of the PL({lambda}) objective function is found only on the same set of parameter values as the log-likelihood function for the standard iid model.

The objective function PL({lambda}) in Equation (3) represents a one-parameter family of estimators that spans the log-likelihood function of models that goes from the ncm model to the standard Jukes-Cantor iid model. The well-known result of Tuffley and Steel (1997) established that the maximum-likelihood solution to the ncm model yields the same tree topologies as the Wagner maximum-parsimony method. Therefore, we can view the estimator that maximizes PL({lambda}) as a family of estimators that can range from the maximum-parsimony estimator to the standard iid maximum-likelihood estimator.

In the following, we present numerical experiments demonstrating some of the properties of the PL({lambda}) estimator. We explicitly carry out computations for the Jukes-Cantor (JC) model; i.e., the four-state symmetric Markov model of evolution with uniform probability of change to any other state. For this model, we only have one parameter at each branch and {theta}i(e) = pi(e), where pi(e) denotes the probability that the states at the ends of the branch are non-identical. For a four-taxon unrooted fully resolved tree, there are five branches, which we number from 1 to 5. Thus, under the JC model, we have the set of parameters {Theta}i = {pi(1), pi(2), pi(3), pi(4), pi(5)} for the ith character. For the JC version of the no-common-mechanisms model, there are five parameters for each character; thus, the log-likelihood function has 5n parameters, whereas the JC standard iid model has a total of five parameters. For these computations, we implemented the PL({lambda}) estimator and numerically optimized the objective function over the parameter space using the truncated Newton method (Nash, 2000). Even for the four-taxon binary case, the ncm model requires optimization over 5n dimensions, which can be prohibitively large for datasets where n is larger than 100. Fortunately, although the model may have different parameters for each character, the optimal value for a given character pattern must be identical since the same optimization applies to the same character pattern. Therefore, the numerical optimization can be carried out restricted to 5 x 15 = 75 dimensions for the 15 different equivalent four-state character patterns under the JC model.

Statistical Consistency
An estimator is statistically consistent if it converges to the true parameter as the amount of data goes to infinity. It is well known that the maximum-parsimony estimator is not statistically consistent for certain model conditions. Felsenstein (1978) introduced the now famous two-parameter tree in which two long branches have state change probabilities p and three short branches have probability q under the binary state model of evolution. He showed that the constraint p2 < q(1 – q) must be satisfied for the maximum-parsimony estimator to be consistent under the binary state model. Similar conditions for inconsistent estimation using maximum parsimony can be found for the four-state JC model. Figure 1 shows computation of the maximum PL({lambda}) estimator for the Felsenstein model tree for various values of p, q, and {lambda} using exact dataset probabilities as the input (i.e., effectively assuming an infinite number of sites). As expected, when {lambda} = 0, the estimator displays the same behavior as the maximum-parsimony estimate, converging to the wrong estimate for parameter values that do not satisfy the quadratic constraint condition. When {lambda} = {infty}, the estimator correctly identifies the model tree for all p and q. The estimated branch parameters for either case also follow those expected for both the maximum-parsimony method and the standard iid maximum-likelihood method (data not shown). For intermediate values of {lambda}, the estimator is statistically consistent over the p, q parameter set in regions that are different from either the maximum-parsimony estimator or the maximum-likelihood estimator. (Note, one of the input values caused numerical instabilities and is shown as a red cell in Fig. 1.)


Figure 1
View larger version (26K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Figure 1 The behavior of the PL estimator for the four-taxon tree and various values of the penalty weight parameter, {lambda}. Each panel shows an array of expected infinite datasets from different tree models used as input to the PL estimator. The tree models are arranged such that the y-axis denotes the branch lengths for the two long branches (marked by dashed branches on the inset trees) and the x-axis denotes the branch lengths of the three short branches. There are three topologies for the four-taxon tree and the returned trees are colored blue for the correct tree and yellow for the incorrect trees. The red shows a parameter condition where a numerical error occurred for the optimization. When {lambda} = 0, the wrong tree is returned for models in the Felsenstein zone; when {lambda} = infinity, all trees are consistently estimated; when {lambda} = 100, some of the trees are inconsistently estimated but different from those for {lambda} = 0.

 
Statistical Power
In practice, we are concerned with the expected accuracy of an estimator in finite samples. For the parameter conditions where the maximum-parsimony estimator is inconsistent, the estimate rapidly converges to the wrong tree as more data are added. Thus, we expect poor finite sample performance for {lambda} = 0 for those model conditions. On the other hand, Siddall et al. (1998) pointed out that there are certain tree models for which the maximum-parsimony methods show better finite sample performance due to preferential bias towards resolving branches, whereas the maximum-likelihood methods show very slow convergence to the correct estimate with more data. Thus, we would expect poor finite sample performance for {lambda} = {infty} for those model conditions. To examine the finite sample performance the PL({lambda}) estimator, we simulated 100 replicates of datasets of size 250 characters over a tree with two long adjacent branches (the so-called Farris model tree) and a tree with two long nonadjacent branches (Felsenstein model tree) and examined the estimates for various values of {lambda}. Figure 2 shows the average finite sample performance of the PL({lambda}) estimator for {lambda} = 0 to {lambda} = very large for those model conditions where each is expected to perform well or poorly. The relative performance of the spectrum of PL({lambda}) estimates shows complementary behavior for the two model trees—there is a different optimal {lambda} for different tree models.


Figure 2
View larger version (28K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Figure 2 The finite sample performance of the PL({lambda}) estimator for various values of {lambda} for two different tree models. The two tree models are schematically represented in the figure—the first model has two long branches that are adjacent to each other (marked by X) and the second model has two long branches that are separated (marked by empty squares). The 100 replicates of 250 characters were simulated over each tree and then estimated by PL({lambda}). The average probability of recovering the true tree is plotted for the two tree models. The finite sample performance shows complementary behavior for the two tree models as a function of {lambda} .

 
Choice of Tuning Parameter {lambda}
The PL({lambda}) estimator is a family of estimators that has both the maximum-parsimony estimator and the standard iid likelihood estimator as special cases. The tuning parameter {lambda} selects for particular kinds of estimators and under the penalized maximum-likelihood framework can be seen as a parameter that controls the complexity of the allowable likelihood model. That is, in the penalized maximum-likelihood framework, we start with a complex likelihood function and impose smoothness constraints on the function by the penalty term. Although the value of {lambda} can be chosen in the same manner as one might choose different models—i.e., by a priori choice based on background knowledge (Burnham and Anderson, 2002)—it would be preferable to choose it based on the input data. If the input data seem to require a more complex likelihood model, we might choose lower values of {lambda} (less smoothness constraint), whereas if the data seem to better fit a simpler model, we might want to choose higher values of {lambda} (more smoothness constraint). There are several approaches to finding an appropriate value of {lambda} from the input data including utilizing approximations based on the information content of the data, such as the Akaike information criteria (AIC) or Bayesian information criterion (BIC; Burnham and Anderson, 2002).

The usual approach is to use cross-validation to assess prediction errors associated with various values of {lambda} (Hastie et al., 2001). Suppose we fit the model with some subset of the available data, say 1/2, then use the resulting model parameter estimates to check how well the model fits the deleted data. If the model over-fits the first half of the data then it will do a poor job of predicting the unseen second half of the data. On the other hand, if the model did a poor job of fitting the first half of the data in the first place, it will also do a poor job of fitting the unseen second half of the data. Therefore, if we can control the model complexity by {lambda} and then measure these prediction errors for various values of {lambda}, we might expect that there will be an optimal value of {lambda} that will have the minimum prediction error because it neither over-fits nor under-fits the data. Cross-validation rarely yields truly optimal values of {lambda} but provides a data driven approximation.

There are several possible implementations of the cross-validation idea in this problem. The procedure we used was (1) separate the input data into two random partitions of the same size, (the training and test datasets); (2) for some value of {lambda} compute the maximum penalized likelihood estimate of the tree and unknown branch parameters for each character using the training dataset; (3) use the parameters of this estimate to compute a probability model of the data columns and then compute the probability of the test dataset as a measure of prediction error; (4) repeat the process 50 times. Figure 3 shows the results of computing prediction error for values of {lambda} ranging from 0 to 106 for four different kinds of datasets. The first dataset consists of sequences from the Drosophila hairy gene (Kim, 2001; "hairy"); the second dataset consists of an equal mixture of four different standard iid models, each generated under a different rate/branch length combination ("4-pt mix"); the third dataset consists of a standard iid model for the Felsenstein model tree, which is expected to be inconsistently estimated by maximum-parsimony ("Fels"); and the fourth dataset is a standard iid model with constant and identical rate of evolution on all branches of the tree ("const"). Some values of {lambda} for some datasets were incompletely evaluated due to numerical instabilities and those points are not included in the figure.


Figure 3
View larger version (30K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Figure 3 Cross-validation procedure for choosing {lambda} from data. To search for optimal value of {lambda}, each dataset was split into 1/2 "training" and 1/2 "testing." The PL({lambda}) estimator was used on the training data to estimate the probability of the data and the probability of the test data (prediction probability) was computed using the estimated model. The prediction probability (y-axis) is plotted for various values of {lambda} for four different datasets: an empirical dataset of the fruitfly gene hairy (open box), an equal mixture of data simulated from four different models (open triangle), data simulated under the "Felsenstein zone" conditions (cross hatch), and data simulated under an equal and constant branch length tree (open circle). Numerical instability prevented evaluation of some {lambda} values.

 
Unfortunately, although there is an indication of peaks (near {lambda} = 1000 for hairy and {lambda} = 100 for the other three datasets) suggesting particular values of optimal {lambda}, the peaks are small and the graph is inconclusive. One of the problems here may be that the dataset size in terms of numbers of taxa is too small to give clear choice. Examining the branch parameter estimates, all maximum-parsimony settings (i.e., {lambda} = 0) result in p = 0.25 (i.e., off-diagonal probabilities of branch Markov matrices) for those branches where change is indicated by the maximum-parsimony ancestral state estimates and p = 0.0 where no change is indicated by the ancestral state estimate. For the hairy dataset, the optimal {lambda} value is found at {lambda} = 1000. Under the PL({lambda} = 1000) estimator, the parameter estimates differ from the standard ML estimate (i.e., PL({lambda} = infinite)) mostly in the terminal branches. The average coefficient of variation between PL({lambda} = 1000) and PL({lambda} = infinite) estimates for the five branch parameters ranges from 0.012 (middle branch) to 1.74 for the second terminal branch (D. simulans branch). The PL({lambda} = 1000) estimator allows moderate variability across characters and the coefficient of variation of the parameters across the characters ranges from 0.005 (middle branch) to 2.59 (D. simulans). Thus, the branch-specific difference in the parameter estimates between the standard ML estimate and the PL({lambda} = 1000) estimate reflects the across-site variability that is allowed by {lambda} = 1000. Although there is a substantial difference in these parameter estimates for different the {lambda} values, and therefore putative evolutionary forces, we caution that each different value of {lambda} constitutes a different model choice. Thus, the interpretation and comparison of the parameter estimates must be carried out with all the cautions relevant to comparing different models. Last, we have also investigated other potential cross-validation criteria such as measuring congruence of tree topologies from random partition of the data; however, we were not successful in finding an intuitively satisfactory manner of finding optimal choice of the tuning parameter. The development of a better scheme for choosing {lambda} is a subject for additional work and it may require additional developments to overcome the computational problem as discussed below.

Alternative Penalty Schemes and Generalizations
In the estimator (3) introduced above, we used the variance penalty function (4). As mentioned previously, the main utility of the penalty function is to generate a positive valued function that increases with increasing diversity of parameter values in the likelihood function. Alternative penalty functions might be useful in different contexts. For example, an alternative penalty might replace var[{theta} (e)] with an entropy-like measure E[{theta} (e)] = Formula matrix and mjki (e) denotes the j,kth element of the matrix for the ith character. Because entropy provides a different measure of complexity, this might provide a different formulation to penalize "structure" in the parameters. The choice of the penalty scheme will affect the properties of numerical optimization and the biological intuition of what classes of models are considered equivalently complex. For example, under the variance penalty scheme, each variance sum Formula has the same values for "circles" around the centroid. That is, the formula Formula describes a circle around the centroid Formula thus, all parameter distribution on the same circle would be considered equivalently complex.

In the estimator (3), we also chose to connect the models between the no-common-mechanisms (ncm) to the standard iid model. There are other classes of complex models that might also be of interest. For example, the so-called Olsen model (Olsen et al., 1994) posits a unique site-specific scaling parameter. This is similar to the gamma mixture model, except instead of assuming that the site-specific rate parameters are samples of a random distribution, the Olsen model assigns a different rate parameter specific to the site. The Olsen model can be used as the starting complex model and we can introduce a penalty term for the variation of the site-specific scaling parameters such that when the variation is suppressed the resulting model is the standard iid model. Similarly, we could start with the ncm model and penalize the change in the relative branch lengths across the sites while allowing site-specific scalar multipliers to be variable, thus connecting the ncm to the Olsen model.

Stochastic mixing models such as the gamma mixture model can also be embedded in the penalized log-likelihood scheme with the appropriate penalty function. For example, suppose we start from the Olsen model with a unique site-specific rate multiplier and some continuous time model such as the general time-reversible (GTR) model (cf., Swofford et al., 1996). The popular gamma mixture model assumes that these rate multipliers are stochastically sampled from a gamma distribution rather than individually assigned. A stochastic rate mixture model can be specified within the penalized likelihood scheme by considering a set of penalty terms that measure the deviation of the distribution of site-specific rate parameters from the expected distribution, say the gamma distribution. Sufficient penalty weights will then force the likelihood function optimum to converge to the distribution of rate parameters. One way to implement such a deviation function is by using the expected moments of the desired mixture distribution and comparing it to the empirical moments of the rate parameters. To be more explicit, the commonly used {Gamma} ({alpha},1/{alpha}) class of gamma mixture models has as its first four non-central moments (1, 1/{alpha}, 2/{alpha} 2, 6/{alpha}3). So, we may consider the penalized likelihood estimator:


Formula 5

(5)
where Formula is the Olsen log-likelihood function, Formula is the vector of site-specific rates, Formula is the jth moment of the assigned rates, and µj is the expected jth moment of the mixture distribution of the rates, with the summation being over the q moments of the mixture distribution. Increasing the values of {lambda} will force the optimal distribution of the rate parameters to have the same moments as the gamma distribution accurate up to the number of moments computed.

Our main point is that the penalty function construction allows us to generate a new parametric family of estimators that contain the likelihood estimators of many of the existing models as special cases. The variance penalty suggested in (3) lumps together all kinds of complexity that might be more rationally partitioned into different classes of complexity that might be of interest from a modeling point of view. To consider such portioning of the complexity classes, we can start with a further generalization of the ncm model where for each character and for each branch we assign an arbitrarily general Markov transition matrix whose only constraint is to be a non-negative matrix whose rows sum to 1. This model is fully general in the sense that for each branch and each site, a general Markov model parameterization allows implicit incorporation of various time-inhomogeneous processes and changes in different transition types. We call this the general no-common-mechanisms model (gncm). It should be noted that in this very general form, the joint probability simplex can be completely contained within the model for a single tree; thus, no tree/model is distinguishable from any other. This can be seen by considering a star-tree where appropriate transition matrices on each edge and a suitable marginal distribution at the root can be used to give probability of one to any character pattern at the terminal taxa (E. Allman and J. Kim, unpublished work). Thus, any tree estimation requires constraining the complexity as described below.

One complexity class of interest is the model of the transition probability between different character states. For example, with DNA characters we have four states and the standard continuous time Markov model is of the form:


Formula 6

(6)
where P(t) is the 4 x 4 conditional probability matrix after t units of time and Q is a 4 x 4 instantaneous transition rate matrix whose rows sum to zero. The form of the Q matrix is often constrained to generate various models including the Jukes-Cantor model (all off-diagonal elements are identical), the Kimura two-parameter model, the six-parameter symmetric model, and so on. We can start with the gncm model where every branch and every character has a conditional probability matrix of different form (and not necessarily members of a continuous time model). For each such matrix, we may penalize the variability of the matrix parameters. In the most straightforward case, we may use the variance of the parameters within each conditional probability matrix. We may, of course, also control variability of the difference in such transition matrices across the sites—as opposed to just the variability of the scalar rate parameters.

The second class of complexity class involves branch-specific variation. Sanderson (2002) has already introduced a penalized likelihood framework for adjacent branch length/rate variation. The gncm model allows any branch length/rate for any character but it may be reasonable to also introduce constraints on branch length/rate variability by a suitable penalty function as studied in Sanderson (2002).

Putting all of these potential models in a single parametric family would suggest a three-parameter estimator of the form:


Formula 7

(7)
where log L({Theta}) denotes the log-likelihood of a base gncm model that allows site-specific parameters, full parameter set for the transition probability matrix (or the rate matrix), and branch-specific parameters. The various variance terms partition the parameter variability into different complexity classes based on whether the parameter variability relates to among-site variability, among-transition states variability, or among-branch variability. The partitioning into these different classes gives us separate parameters (the {lambda}i 's) to control the particular class of complexity independent of the other classes.

Connections to Bayesian Inference
Up to now we have described the penalized likelihood framework as a scheme to embed different model classes in a single continuously parameterized estimator family. There is an alternative probabilistic interpretation of the PL({lambda}) estimator that may provide a more natural setting and perhaps more efficient numerical solutions. Exponentiation of the penalized log-likelihood yields


Formula 8

(8)
where var({Theta}) denotes the various variance components penalizing the variability of the parameter set. Equation (8) denotes the probability of the data (likelihood) multiplied by a negative exponential function that is a decreasing function of the penalty variance. The negative exponential term can be interpreted as a prior distribution from the exponential family on the parameter set seen as random variables, with the appropriate normalization of the distribution. In fact, if we use the variance as the penalty in Equation (8), an alternative formulation is to expand the sum of squares around expected constants. Then the prior distribution could be rewritten as a multi-dimensional normal distribution. (We note that this is a different kind of prior than used in the currently commonly available Bayesian estimators.) The estimate that maximizes PL({lambda}) can then be seen as a Bayesian maximum a posteriori (MAP) estimate of the parameter using the exponential prior family. From this Bayesian framework, we have an interesting view of the PL({lambda}) estimator. Under the gncm model, or when {lambda} = 0, the space of parameters has (countably) infinite dimension since we need a unique set for each new character sampled. The exponential prior family specifies a probability distribution over this infinite dimensional parameter space such that as {lambda} goes to infinity, the probability becomes concentrated on a finite dimensional subspace that corresponds to the finite number of parameters used in the standard iid model. It should be noted that other works have previously connected different classes of model classes by prior distributions such as the collection of time-reversible models suggested by Huelsenbeck et al. (2004) and the amino acid model classes and Dirichlet process priors proposed in Lartillot and Philippe (2004). These models also "connect" different model classes moving between models of different dimensions. In particular, the prior distributions used in these studies allow a non-zero prior probability on models of different dimensions. Our prior distribution has a more general structure applying a probability distribution on the general infinite-dimensional parameter space. However, it does have the problem of inducing zero probability for any particular finite dimensional model except for the convergence set at {lambda} = infinity. On the other hand, if an argument was to be made that a certain degree of complexity in the data should be reflected in the models, then it seems reasonable that we should consider a probabilistically "positive" volume of complex models rather than a particular low-dimensional subset. As a final comment, it should be noted that by framing the structure of the penalized maximum-likelihood estimator in terms of Bayesian priors we do not mean to imply that it is in fact a Bayesian estimator in the classic sense. In the Discussion, we elaborate on the conceptual differences of different estimators.


    Discussion
 Top
 Abstract
 Discussion
 Acknowledgments
 References
 
Penalized Likelihood in Tree Inference
By combining a nonparametric model of molecular evolution with a penalty on the variance in rates across sites, we have been able to study a very diverse family of estimators that spans the range from maximum-parsimony to the standard iid maximum-likelihood estimator. The known statistical properties of MP and ML(iid) are reproduced using PL with the appropriately chosen extreme values of its smoothing penalty (0 or {infty}). For example, the better finite sample performance of MP for some tree models and the guaranteed consistency of ML(iid) are found to hold for PL at these extremes. By choosing the appropriate penalty the PL estimator can accommodate variation that might take on many distributional forms, from gamma distributed to arbitrary mixtures of different distributions. The fact that the "Felsenstein zone" of inconsistency for PL varies with the smoothing parameter implies that the range of models "dialed in" by the tuning parameter will map to different ranges of models than do previously studied methods. From a model selection point of view, the penalized log-likelihood framework embeds a very large family of seemingly heterogeneous models into a single family such that the model appropriateness can be tested on a single or a small number of continuous tuning parameters ({lambda} s) rather than in discrete steps.

Since Felsenstein (1978) launched the statistical study of phylogenetic methods, numerous subsequent studies have added some caveats. This course of developments would seem to be in keeping with his comments in Felsenstein (1978) on the issue of statistical robustness and his call to explore as broad a spectrum of evolutionary models as possible. It is now clear from specific cases that departures from an assumed model can cause inconsistency of both ML and Bayesian inference (Kim, 2000; Kolaczkowski and Thornton, 2004; Mossel and Vigoda, 2005). Thus, any inference method might outperform others. However, robustness is difficult to characterize, because even in the four-taxon case, the number of possible models gets very large as models head in the direction of gncm. As some of the controversy over the simulation studies of Kolaczkowski and Thornton (2004) attest, it is unclear how to erect good performance evaluations on subsets of this vast parameter space. The PL framework, by embedding various different complexity classes into a single family, might allow us to find estimators that perform robustly over different segments of this space.

Among-Site Rate Variation and Model Complexity
One place in phylogenetic inference where data are rapidly driving increases in model complexity is in patterns of among-site rate variation (ASRV). That different rates of amino acid replacement across proteins should force adjustments to simple models used in molecular evolution and phylogenetics has been understood a long time (Uzell and Corbin, 1971), as have the rate differences expected between synonymous and nonsynonymous nucleotide substitutions (Li, 1997). The explicit inclusion of some form of ASRV often has significant impacts on phylogenetic inference in real data (e.g., Sanderson et al., 2001; Burleigh and Mathews, 2007). However, increasing numbers of multi-locus datasets and the difficulty of accommodating differing synonymous versus nonsynonymous rates with one-parameter gamma ASRV models prompted much development of highly partitioned data analysis, in which different genes, gene regions, or arbitrary subsets could have entirely different models (Nylander et al., 2004). The popularity of this approach was ensured by its implementation in MrBayes (Huelsenbeck and Ronquist, 2001), the widely used package for Bayesian inference.

The incorporation of ASRV in a complex mixture form does not exhaust the model complexity classes. In addition to genome-wide lineage effects, in which certain taxa appear to have undergone wholesale accelerations or decelerations (e.g., Kawahara and Imanishi, 2007), synonymous rates are correlated with GC content, recombination rate, exon density, and proximity to telomeres (Clark et al., 2003; Singh et al., 2005; Arndt et al., 2005). Selective sweeps, in which strong selection on a small region of the genome induces rapid change in closely linked loci, may have left its mark on 10% of the human genome (Williamson et al., 2007), inducing much rate variation. Finally, ASRV may be combined with rate variation across the tree in complex covarion type models (Ane et al., 2005) or models in which ASRV patterns evolve homoplastically across the tree (Susko et al., 2004) or show differences in their substitution rate matrices in different parts of the tree (Pagel and Meade, 2004). In full complexity, the evolution of a character may vary in its transitional tendency (i.e., the substitution matrix) and probability of event occurrence both in time and across sites. Furthermore, these variability patterns may have high-order dependency as when particular combinations of amino acids interact to generate protein structure (Rodrigue et al., 2006) or when long-term environmental changes induce large timescale dependencies (see also Thorne, 2007, for review).

The increasing availability of genome-scale data seems to call for increasing complexity of the models or at least models that explicitly try to handle heterogeneity of the datasets. However, as we move towards more complex classes of models, we can be left with a bewildering array of choices and perhaps not a clear reason why a particular finite dimensional model might be better fit or more realistic than another. Our approach, rather than explicitly adding parameters to increase complexity, uniformly replaces the various kinds of complexity by a single notion of variance or variability. On one hand, this loses a certain degree of interpretability of the models; yet on the other hand, we gain both generality and a unified framework for modeling complexity.

We note that in the above sections we have used the phrase "model complexity" without being too precise, and we stated that the PL framework allows us to start with a model of high complexity and "limit its complexity" by the tuning parameter, {lambda}. Under standard Markov models, model complexity is often measured by the number of parameters or, (almost) equivalently, the dimensions of the parameter set. The dimensions of the parameter set can be reduced through integrating out some subset of parameters under standard Bayesian frameworks (e.g., Goloboff, 2003). Under the PL framework, the model starts from an infinite-dimensional parameter set and then converges in prior probability to a lower-dimensional parameter space. If we use multiple tuning parameters, we can induce convergence to different dimensions dependent on the marginal values of the tuning parameters. However, except at the probabilistic limit, the prior measure on the model is over very high (or infinite) dimensions and thus it is difficult to discuss "complexity," especially vis-à-vis standard model complexity. Therefore, our use of the term complexity is inexact and descriptive.

Computational Barriers to Penalized Maximum-Likelihood Estimate
In the general formulation, the PL approach requires first allowing for a separate parameter set for each character (site) and then optimizing with respect to the inclusion of the penalty term. If we have t-taxon trees and n total characters, we will have O(tn) number of parameters to optimize (excluding the consideration of the number of parameters per branch, p, which is typically fixed and small). Even for a moderate sized problem, say 30 taxa and 1000 sites, this would result in a O(30,000)-dimensional optimization problem. This is a prohibitory number of variables to optimize with standard numerical algorithms requiring, in the best case, O({nu}2) function evaluations for {nu} variables. Certain symmetries of the model can reduce the optimization problem as noted in our Jukes-Cantor model where the problem reduced to a 15-dimensional set. However, as the number of taxa increases, such reduction becomes more difficult and also in the more complex models we may not want to assume exchangeability of sites (characters) of the data matrix. On the slightly positive side, the problem scales as a product of the numbers of taxa and numbers of characters, not exponentially. Given the heuristic limits of tree estimation, it may be that if we can solve for t = 100 and n = 10,000, the PL approach might be seen to be practical.

The Bayesian estimator framework outlined in equation (8) suggests that optimization of PL({lambda}) estimator might be approached using the expectation maximization (EM) algorithm, which is often used to find the MAP estimate of a Bayesian estimator (Gelman et al., 2003). More recent developments of generalized EM, e.g., variational EM and mean field approximations, have been used to compute optimizations for tens of thousands of variables (Platt et al., 2008) and the general approach has been suggested to scale to millions of variables (Nguyen and Jordan, 2008). We also note that similar to approximations used in other complex models, a priori clustering of data into blocks (e.g., by coding positions, by genes, or by distributional patterns) has the potential to reduce the computational problem by several orders. Thus, we believe practical computational approaches may exist for PL optimizations for problems up to 100 taxa and 10,000 characters.

Parsimony versus Model-Based Inference: A False Dichotomy?
Previously (Sanderson and Kim, 2000) we raised a number of questions about the wholesale shift away from parsimony-based inference in phylogenetics and toward computationally expensive and inferentially complex model-based methods. Since then several things have happened. Although parsimony methods have continued to provide useful results especially in very large datasets, their overall use has also continued to decline. This may be attributable to the progress made in improving running times in ML (e.g., Zwickl, 2002; Stamatakis, 2006); to the development of strategies for model selection (Sullivan and Joyce, 2005); and to addressing new issues that beset Bayesian inference, which we did not even consider in our original paper, such as MCMC convergence and apparent overcredibility of posterior probabilities (Lewis et al., 2005; Yang and Rannala, 2005). An important critique of parsimony as a "nonparametric method" is that one expects a good nonparametric statistic to be robust to model variation (e.g., the signed-rank test) and parsimony does not qualify or at least has not been shown to qualify in this regard (Spencer et al., 2005). However, it is clear that the situation is not so black-and-white and there are many cases in which one might employ a computationally more efficient method as suggested in Sanderson and Kim (2000) or wish to tradeoff variance with the bias of the parsimony estimate for better finite sample performance.

However, stepping back from such details, all of the recent literature shows that the argument is not whether certain classes of methods always outperform other classes of methods but rather when this happens. Within the high-dimensional space of model parameters (say the gncm space), it is well established that MP fails to be consistent in part of this space, and ML and certain Bayesian methods fail to be consistent in other parts of this space (generally in cases in which the assumed model does not closely correspond to the model in that region of the space). The PL framework does not provide a universal solution to these important issues of consistency, efficiency, identifiability, etc. However, it does allow the study of many different flavors of models or methods in a mathematically convenient unified framework. In this paper, we were originally motivated to show that various different approaches can be shown to have structural connections. Thus, previous conceptual dichotomizations, such as the contentious contrast between maximum-parsimony and maximum-likelihood, are, to a certain extent, artificial in a modeling sense. Moreover, in terms of the statistical properties of the estimators, there is a continuity. For example, maximum-parsimony fails to be consistent under certain conditions, whereas it is efficient in other conditions and generally robust to mixtures. The standard maximum-likelihood method has complementary properties. But, as we show above, other estimators can be constructed from the appropriate choice of the penalty tuning parameter, {lambda}, that have different sets of properties.

We add the caveat that although we draw structural connections between different methods, we do not imply here that the methods are identical in a conceptual sense. For example, Bayesian methods have a conceptual framework that is not captured by pointing to Equation (8) and suggesting that they are a variation of the penalized likelihood (or vice versa). In particular, obtaining a single MAP estimate is counter to the Bayesian concept of estimating distributions rather than single points. For example, recently, Huelsenbeck et al. (2008) constructed a Bayesian estimator from Tuffley and Steel's ncm model integrating over the branch lengths, which follows the classic Bayesian framework and is distinct from the MAP estimate studied in this paper. Of course, maximum-parsimony has a conceptual framework that is separate from our characterization of it as an infinite-dimensional likelihood estimator (Sober, 1988). Future work will inevitably involve mapping the tradeoffs between the characteristics of different methods/models for particular datasets regardless of the underlying conceptual differences. And, the penalized likelihood framework has the advantage of allowing systematic investigation across a broad spectrum of methods and model assumptions.


    Acknowledgments
 Top
 Abstract
 Discussion
 Acknowledgments
 References
 
This work was funded in part by NSF grant DEB 0334866 and DEB 0715370 to J.K. and DEB 0733365 to M.S. We thank Mark Holder, Laura Kubatko, and an anonymous reviewer for greatly improving this paper.


    References
 Top
 Abstract
 Discussion
 Acknowledgments
 References
 

    Ane C., Burleigh J. G., McMahon M. M., Sanderson M. J. Covarion structure in plastid genome evolution: A new statistical test. Mol. Biol. Evol. (2005) 22:914–924.[Abstract/Free Full Text]

    Arndt P. F., Hwa T., Petrov D. A. Substantial regional variation in substitution rates in the human genome: Importance of GC content, gene density, and telomere-specific effects. J. Mol. Evol. (2005) 60:748–763.[CrossRef][Web of Science][Medline]

    Burleigh J. G., Mathews S. Assessing among-locus variation in the inference of seed plant phylogeny. Int. J. Plant Sci. (2007) 168:111–124.[CrossRef][Web of Science]

    Burnham K. P., Anderson D. R. Model selection and multi-model inference (2002) 2nd. New York: Springer.

    Felsenstein J. Evolutionary trees from DNA sequences: A maximum likelihood approach. J. Mol. Evol. (1981) 17:368–376.[CrossRef][Web of Science][Medline]

    Fujisawa H., Eguchi S., Ushijima M., Miyata S., Miki Y., Muto T., Matsuura M. Genotyping of single nucleotide polymorphism using model-based clustering. Bioinformatics (2004) 20:718–U489.[Abstract/Free Full Text]

    Gadagkar S. R., Kumar S. Maximum-likelihood outperforms maximum-parsimony even when evolutionary rates are heterotachous. Mol. Biol. Evol. (2005) 22:2139–2141.[Abstract/Free Full Text]

    Gaucher E. A., Miyamoto M. M. A call for likelihood phylogenetics even when the process of sequence evolution is heterogeneous. Mol. Phyl. Evol. (2005) 37:928–931.[CrossRef][Web of Science][Medline]

    Gelman A., Carlin J. B., Stern H.S., Rubin D.B. Bayesian data analysis (2003) 2nd. New York: Chapman & Hall/CRC.

    Glasbey C. A., Mardia K. V. A penalized likelihood approach to image warping. J. R. Stat. Soc. B (2001) 63:465–492.[CrossRef]

    Goloboff P. A. Parsimony, likelihood, and simplicity. Cladistics (2003) 19:91–103.[CrossRef][Web of Science]

    Hastie T., Tibshirani R., Friedman J. H. The elements of statistical learning: Data mining, inference, and prediction (2001) New York: Springer.

    Huelsenbeck J. P., Ané C., Larget B., Ronquist F. A Bayesian perspective on a non-parsimonious parsimony model. Syst. Biol. (2008) 57:406–419.[Abstract/Free Full Text]

    Huelsenbeck J. P., Larget B., Alfaro M. Bayesian phylogenetic model selection using reversible jump Markov chain Monte Carlo. Mol. Biol. Evol. (2004) 21:1123–1133.[Abstract/Free Full Text]

    Jeffroy O., Brinkmann H., Delsuc F., Philippe H. Phylogenomics: The beginning of incongruence? Trends Genet. (2006) 22:225–231.[CrossRef][Web of Science][Medline]

    Kawahara Y., Imanishi T. A genome-wide survey of changes in protein evolutionary rates across four closely related species of Saccharomyces sensu stricto group. BMC Evol. Biol. (2007) 7.

    Kim J. Macroevolution of the hairy enhancer in Drosophila species. J. Exp. Zool. (Mol. Dev. Evol.) (2001) 291:175–185.[CrossRef]

    Kolaczkowski B., Thornton J. W. Performance of maximum-parsimony and likelihood phylogenetics when evolution is heterogeneous. Nature (2004) 431:980–984.[CrossRef][Medline]

    Lartillot N., Philippe H. A Bayesian mixture model for across-site heterogeneities in the amino-acid replacement process. Mol. Biol. Evol. (2004) 21:1095–1109.[Abstract/Free Full Text]

    Lewis P. O., Holder M. T., Holsinger K. E. Polytomies and Bayesian phylogenetic inference. Syst. Biol. (2005) 54:241–253.[Abstract/Free Full Text]

    Li W.-H. Molecular evolution (1997) Sunderland, Massachusetts: Sinauer.

    Lockhart P., Steel M. A tale of two processes. Syst. Biol. (2005) 54:948–951.[Free Full Text]

    Mossel E., Vigoda E. Phylogenetic MCMC algorithms are misleading on mixtures of trees. Science (2005) 309:2207–2209.[Abstract/Free Full Text]

    Nash S. G. A survey of truncated-Newton methods. J. Comput. Appl. Math. (2000) 124:45–59.[CrossRef]

    Nash S. G., Sofer A. Linear and nonlinear programming (1996) New York: McGraw-Hill.

    Nguyen X. L., Jordan M. I. On the concentration of expectation and approximate inference in layered networks. In: Adv. Neural. Inform. Proc. Syst. (NIPS) (2003) 17. Cambridge, Massachusetts: MIT Press.

    Olsen G. J., Matsuda H., Hagstrom R., Overbeek R. fastDNAml: A tool for construction of phylogenetic trees of DNA sequences using maximum-likelihood. Comput. Appl. Biosci. (1994) 10:41–48.[Abstract/Free Full Text]

    Pagel M., Meade A. A phylogenetic mixture model for detecting pattern-heterogeneity in gene sequence or character-state data. Syst. Biol. (2004) 53:571–581.[Abstract/Free Full Text]

    Park T. Alternative penalty functions for penalized likelihood principal components. J. Appl. Stat. (2007) 34:767–777.[CrossRef][Web of Science]

    Philippe H., Delsuc F., Brinkmann H., Lartillot N. Phylogenomics. Ann. Rev. Ecol. Evol. Syst. (2005) 36:541–562.[CrossRef]

    Platt J. C., Kiciman E., Maltz D. A. Fast variational inference for large-scale internet diagnosis. Neural Inform. Proc. Syst. (2008) 20. 2007.

    Siddall M. E. Success of parsimony in the four-taxon case: Long-branch repulsion by likelihood in the Farris Zone. Cladistics (1998) 14:209–220.[CrossRef][Web of Science]

    Simmons M. P., Zhang L. B., Webb C. T., Reeves A., Miller J. A. The relative performance of Bayesian and parsimony approaches when sampling characters evolving under homogeneous and heterogeneous sets of parameters. Cladistics (2006) 22:171–185.[CrossRef][Web of Science]

    Singh N. D., Arndt P. F., Petrov D. A. Genomic heterogeneity of background substitutional patterns in Drosophila melanogaster. Genetics (2005) 169:709–722.[Abstract/Free Full Text]

    Sober E. Reconstructing the past (1988) Cambridge, Massachusetts: MIT Press.

    Spencer M., Susko E., Roger A. J. Likelihood, parsimony, and heterogeneous evolution. Mol. Biol. Evol. (2005) 22:1161–1164.[Abstract/Free Full Text]

    Stamatakis A. RAxML-VI-HPC: Maximum-likelihood-based phylogenetic analyses with thousands of taxa and mixed models. Bioinformatics (2006) 22:2688–2690.[Abstract/Free Full Text]

    Steel M. Sufficient conditions for two tree reconstruction techniques to succeed on sufficiently long sequences. SIAM J. Disc. Math. (2001) 14:36–48.[CrossRef]

    Sullivan J., Joyce P. Model selection in phylogenetics. Ann. Rev. Ecol. Evol. Syst. (2005) 36:445–466.[CrossRef]

    Susko E., Inagaki Y., Roger A. J. On inconsistency of the neighbor-joining, least squares, and minimum evolution estimation when substitution processes are incorrectly modeled. Mol. Biol. Evol. (2004) 21:1629–1642.[Abstract/Free Full Text]

    Swofford D. L., Olsen G. J., Waddell P. J., Hillis D. M. Phylogenetic inference. In: Molecular systematics—Hillis D. M., Moritz C., Mable B. K., eds. (1996) Sunderland, Massachusetts: Sinauer. 407–514.

    Tehler A., Little D. P., Farris J. S. The full-length phylogenetic tree from 1551 ribosomal sequences of chitinous fungi. Fungi Mycol. Res. (2003) 107:901–916.

    Thorne J. L. Protein evolution constraints and model-based techniques to study them. Curr. Opin. Struct. Biol. (2007) 17:337–341.[CrossRef][Web of Science][Medline]

    Tuffley C., Steel M. Links between maximum-likelihood and maximum-parsimony under a simple model of site substitution. Bull. Math. Biol. (1997) 59:581–607.[Web of Science][Medline]

    Uzzell T., Corbin K. W. Fitting discrete probability distributions to evolutionary events. Science (1971) 172:1089–1096.[Abstract/Free Full Text]

    Williamson S. H., Hubisz M. J., Clark A. G., Payseur B. A., Bustamante C. D., Nielsen R. Localizing recent adaptive evolution in the human genome. Plos Genet. (2007) 3:901–915.[Web of Science]

    Yang Z., Rannala B. Branch-length prior influences Bayesian posterior probability of phylogeny. Syst. Biol. (2005) 54:455–470.[Abstract/Free Full Text]

    Zwickl D. J. Genetic algorithm approaches for the phylogenetic analysis of large biological sequence datasets under the maximum-likelihood criterion. In: PhD Dissertation (2006) Austin, Texas: University of Texas at Austin.


Add to CiteULike CiteULike   Add to Connotea Connotea   Add to Del.icio.us Del.icio.us    What's this?


This article has been cited by other articles:


Home page
Mol Biol EvolHome page
J. Wu and E. Susko
General Heterotachy and Distance Method Adjustments
Mol. Biol. Evol., December 1, 2009; 26(12): 2689 - 2697.
[Abstract] [Full Text] [PDF]


This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (PDF) Freely available
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in ISI Web of Science
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Kim, J.
Right arrow Articles by Sanderson, M. J.
Right arrow Search for Related Content
PubMed
Right arrow Articles by Kim, J.
Right arrow Articles by Sanderson, M. J.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?