© 2007 Society of Systematic Biologists
Estimation of Phylogeny and Invariant Sites under the General Markov Model of Nucleotide Sequence Evolution
Edited by Mike Steel: Associate Editor
1 Sydney Bioinformatics, University of Sydney NSW 2006, Australia
2 Centre for Mathematical Biology, University of Sydney NSW 2006, Australia
3 School of Mathematics and Statistics, University of Sydney NSW 2006, Australia
4 School of Biological Sciences, University of Sydney NSW 2006, Australia E-mail: lars.jermiin{at}usyd.edu.au
| Abstract |
|---|
|
|
|---|
The models of nucleotide substitution used by most maximum likelihood–based methods assume that the evolutionary process is stationary, reversible, and homogeneous. We present an extension of the Barry and Hartigan model, which can be used to estimate parameters by maximum likelihood (ML) when the data contain invariant sites and there are violations of the assumptions of stationarity, reversibility, and homogeneity. Unlike most ML methods for estimating invariant sites, we estimate the nucleotide composition of invariant sites separately from that of variable sites. We analyze a bacterial data set where problems due to lack of stationarity and homogeneity have been previously well noted and use the parametric bootstrap to show that the data are consistent with our general Markov model. We also show that estimates of invariant sites obtained using our method are fairly accurate when applied to data simulated under the general Markov model.
Keywords: Invariant sites; maximum likelihood; nonhomogeneous process; nonstationary process; nucleotide substitution; phylogenetics
Received May 2, 2006; Revised July 9, 2006; Accepted September 12, 2006
Many maximum-likelihood-based methods for phylogenetic inference rely on explicit models of nucleotide substitution, the common underlying assumption being that the process is Markovian (i.e., the conditional probability of change at a given site depends only on the current state and is independent of the previous states). In addition, if we assume that the Markovian process is stationary, homogeneous, and reversible (for details, see Jayaswal et al., 2005; Ababneh et al., 2006b) and that each site furthermore is independent and identically distributed (iid), then we obtain the family of time-reversible substitution models, which ranges from the simple model of Jukes and Cantor (1969) to the general time-reversible (GTR) model of Lanave et al. (1984).
One of the factors known to bias the phylogenetic inference is compositional heterogeneity among sequences (e.g., Lockhart et al., 1994; Foster and Hickey, 1999; Chang and Campbell, 2000; Conant and Lewis, 2001; Jermiin et al., 2004), which implies a violation of the assumption of stationarity, and so also a violation of the assumptions of homogeneity and reversibility. Even in the absence of compositional heterogeneity, rate heterogeneity among lineages can confound phylogenetic estimates (e.g., Felsenstein, 1978; Hasegawa et al., 1991; Steel et al., 1993; Ho and Jermiin, 2004), so there is a need for more general nucleotide substitution models that relax the assumptions of stationarity, reversibility, and homogeneity. Based on the maximum-likelihood criterion, such models have been proposed by Barry and Hartigan (1987), Yang and Roberts (1995), Galtier and Gouy (1998), Galtier et al. (1999), and Foster (2004).
Yang and Roberts (1995) considered separate HKY models (Hasegawa et al., 1985) along different edges of the tree by varying the frequency parameters but encountered a convergence problem even for as few as four taxa. Based on Tamura's (1992) model, which is simpler than the HKY model, Galtier and Gouy (1998) developed a model that can account for nonstationary and nonhomogeneous processes and used it to infer the phylogeny of 16 bacterial ribosomal RNA sequences. Their model was generalized by Galtier et al. (1999) to account for rate heterogeneity across sites. Though none of these models require global homogeneity, they still assume local homogeneity; i.e., the evolutionary process over a particular edge of the tree cannot change with time—this is also true for Foster's (2004) model. All of these models can therefore be considered special cases of Barry and Hartigan's (1987) model, which does not require stationarity, reversibility, and homogeneity (global or local).
Apart from relaxing the assumptions of stationarity, reversibility, and homogeneity, the Markovian models of molecular evolution can be made more realistic by allowing for different rates of evolution at different sites along a sequence. Ignoring this information can result in an underestimation of edge lengths and biases in the estimation of other parameters (e.g., Wakeley, 1994; Yang et al., 1994). Two of the most commonly used approaches for considering rate heterogeneity across sites are (i) grouping of sites into variable and fixed sites, with the processes at variable sites being iid (e.g., Lockhart et al., 1996; Steel et al., 2000; Swofford, 2002; Guindon and Gascuel, 2003); and (ii) using a continuous or discretized
-distribution to integrate over rates at individual sites (e.g., Yang, 1993, 1994). Although the family of time-reversible substitution models has been extended to incorporate both these approaches, and Yang and Roberts' (1995) models have been extended to incorporate the latter approach, the Barry and Hartigan (1987) model, henceforth called the BH model, currently assumes that all sites are iid.
Jayaswal et al. (2005) implemented the BH model and highlighted the advantages of using this model as opposed to the family of time-reversible substitution models, such as (i) obtaining a good fit even when the process is not stationary, reversible, and homogeneous; (ii) checking the assumption of reversibility; and (iii) finding instances of convergent evolution more frequently compared to the family of time-reversible substitution models. Although the BH model is promising for the above three reasons, it is of limited use as it assumes that the processes at the sites are iid. This assumption can be relaxed by assuming that the nucleotides at some sites remain constant; i.e., invariant over time. In this paper, we use the term "unchanged" to denote those sites that have the same nucleotide in all sequences, and the term "invariant" to denote unchanged sites that are not allowed to change.
The proportion of invariant sites can be estimated using a capture-recapture method (Sidow et al., 1992; Steel et al., 2000) or a maximum-likelihood-based approach. Phylogenetic software packages, such as Splits-Tree (Huson and Bryant, 2006), use the former method, whereas other packages, such as PAUP* (Swofford, 2002) and PHYML (Guindon and Gascuel, 2003), use the latter approach to estimate the proportion of sites that are invariant. The capture-recapture methods estimate the proportion of sites that are invariant but not their nucleotide composition. Therefore, further analysis of the data, namely the estimation of tree topology and edge lengths, which requires removal of the invariant sites, is based on the assumption that the proportion of invariant As, Cs, Gs, and Ts is (i) equal to 0.25; (ii) equal to the proportion of As, Cs, Gs, and Ts in the data set; or (iii) equal to the proportion of As, Cs, Gs, and Ts in the sites that appear unchanged. As the correct estimate of the nucleotide composition of invariant sites may differ from these proportions, it is desirable to estimate the parameters for the variable and invariant sites separately, as pointed out by Waddell and Steel (1997). Unlike the capture-recapture methods, the maximum-likelihood methods can be used directly to estimate the nucleotide composition of invariant sites.
In this paper, we describe an approach for estimating invariant sites without the assumptions of stationarity, reversibility, and homogeneity. In doing so, we assume that (i) the sites in an alignment can be divided into two groups—variable and invariant (all sites excluding the invariant sites are said to be variable); (ii) all variable sites evolve under independent and identically distributed Markov processes. Next, we analyze an alignment of bacterial ribosomal RNA sequences using a version of the BH model that accounts for the presence of invariant sites (i.e., the BH + I model). Finally, parametric bootstrapping is used to show (i) that the bacterial data set is consistent with BH + I model; and (ii) that the maximum-likelihood estimates of invariant sites are close to the actual values in data generated under the model.
| Method |
|---|
|
|
|---|
We introduce notation similar to that used by Jayaswal et al. (2005). Let T denote an unrooted binary tree with l leaves, l – 2 internal nodes, and 2l – 3 edges, for l
2. Denote leaves by L = {–1, ..., –l} and internal nodes by I = {1, ..., l–2}. Denote the set of all nodes V = L
I and the set of all edges by E = {(a, b) : a, b
V and adjacent }. If an edge (a, b) of the unrooted tree is deleted, then two rooted subtrees T(a,b) and T(b,a) are formed with roots at a and b, respectively.
Let B = {A,C,G,T} and Xa and Xb be the nucleotides at nodes a and b of the edge (a, b). The joint probability along the edge (a, b) is
|
| (1) |
|
| (2) |
Let L(a,b) = L
T(a,b) and I(a,b) = I
T(a,b) denote the sets of leaf nodes and internal nodes, respectively, in T(a,b), then xL(a,b) and xT(a,b) denote the vectors of the values of nucleotides in L(a,b) and T(a,b), respectively, and
|
| (3) |
Take the model to be Markovian, so that, given (Xa,Xb) = (xa,xb), the conditional distribution of the nucleotides on the leaves of the rooted subtrees T(a,b) and T(b,a), obtained by deleting edge (a,b), are independent. At each site i = 1, ..., N the value of a nucleotide at the rth leaf, xri is known, but at internal nodes the nucleotide can take any value in B. Let Bri = {xri} if r
L, and B if r
I. Then the joint probability distribution of leaf nodes at site i can be expressed as
|
| (4) |
Bai, xb
Bbi) is an indicator function that takes the value 1 if both conditions are satisfied, 0 otherwise, and
|
| (5) |
Note that (4) and (5) hold for all edges (a,b)
E. Therefore, it is necessary that
|
| (6) |
Differentiating Equation (4) with respect to Q(a,b) (xa,xb), we obtain
|
|
So far we have concerned ourselves with the process at a single site. We will now focus on the processes at different sites and use the following notation to describe site-specific information:
- Let Y = m x N denote the matrix of aligned nucleotide sequences with m being the number of sequences and N being the number of sites in the alignment. Then, yi denotes the ith column of the data matrix.
- Let N2 denote the number of sites that appear unchanged, i.e., sites having the same nucleotide in all sequences, and let N1 = N – N2 denote the remaining number of sites. Because the sites are assumed to be independent, we can rearrange them without loss of generality such that the last N2 sites correspond to the unchanged sites.
- Let β denote the probability of a site being invariant and
denote the probability that a site is variable.
- Let H denote the set of column vectors of length m, all elements of a vector being identical and equal to A, C, G, or T.
- Let P(yi| inv) denote the probability that the ith site is of type h
H, given that it is an invariant site. Then, P(yi| inv) = P(h| inv) for all sites having the pattern yi = h.
Now, the overall log-likelihood can be written as
|
| (7) |
|
| (8) |
Because in the current model it is assumed that an invariant site does not change over the entire phylogenetic tree, P(yi| inv) = 0 when yi
H. Therefore, (8) can be rewritten as
|
|
|
| (9) |
Because the maximum-likelihood estimates of the parameters have to be determined subject to the constraints that
xa
B
xb
BQ(a,b)(xa,xb) = 1,
+ β = 1 and
h
HP(h| inv) = 1, we introduce Lagrange multipliers for each of the three constraints and maximize
|
| (10) |
|
| (11) |
|
| (12) |
As in Section 3.2 of Jayaswal et al. (2005), it can be shown that (12) satisfies the constraint of internal consistency, i.e., (6) holds for all internal nodes.
|
| (13) |
|
| (14) |
|
| (15) |
Using the appropriate constraint from Equation (10), the values of
1,
2, and
3 can be determined. For example,
|
|
|
| (04) |
|
| (16) |
and β can be obtained. Note that the estimates obtained at the end of each iteration using Equations (12) to (15) are always non-negative because all the terms on the right-hand side of these equations, including the Lagrange multipliers, are non-negative.
The initial values for the Q-matrices are chosen such that the diagonal elements are greater than the off-diagonal elements because, as noted by Jayaswal et al. (2005), such Q-matrices tend to converge to a common local maximum. If extreme values of Q-matrices are chosen with the off-diagonal elements being greater than the diagonal ones, then we can obtain convergence to local maxima that differ from the most common one. During simulations we observed that the local maxima obtained using extreme values of Q-matrices were lower than the common maximum. Also, the initial Q-matrices should satisfy the constraint of internal consistency; i.e., (6) must be satisfied for all internal nodes.
The initial values of the remaining parameters are chosen such that 0 < β, P(A| inv), P(C| inv), P(G| inv), P(T| inv) < 1 and
X
BP(X| inv) = 1. Next, Equations (12) to (15) are used to update the parameter values until convergence is achieved.
In the BH +I model described above, all the parameters are updated simultaneously at the end of each iteration. However, the BH+I model can also be implemented as follows: (i) remove a proportion of the unchanged sites (which are estimated to be invariant) from the data set and use the BH model over the remaining sites to estimate the Q-matrices; (ii) use the revised Q-matrices to estimate β, P(A| inv), P(C| inv), P(G | inv), and P(T| inv). The two steps are repeated until convergence is achieved. The expectation-maximization (EM) (Dempster at al., 1977; Kung et al., 2004) approach to parameter optimizaton of the BH+I model essentially reduces to the two steps described in this paragraph. It gives the same result as that obtained using the direct optimization of parameters but the current implementation of the EM algorithm is computationally slower. A program that implements our model is available from http://www.bio.usyd.edu.au/jermiin/programs.htm.
Reeves (1992) used an EM algorithm for estimating the parameters for amino acid sequences. However, his programs are not available online and do not appear to have been incorporated into any of the existing phylogenetic software packages.
| Results |
|---|
|
|
|---|
Bacterial Data
We analyzed a bacterial data set, previously examined by Jayaswal et al. (2005) and Ababneh et al. (2006a), using the BH+I model. The data set comprises an alignment of 1238 nucleotides from the 16S ribosomal RNA genes (with abbreviated names given in parentheses) of: Aquifex pyrophilus (Apyr), Thermotoga maritima (Tmar), Thermus thermophilus (Tthe), Deinococcus radiodurans (Drad), and Bacillus subtilis (Bsub). Using methods described in Ababneh et al. (2006a), these data were found to violate the assumptions of stationarity, reversibility, and homogeneity (Jayaswal at al., 2005).
Table 1 shows some of the phylogenetic results obtained using the BH and BH+I models. The number of parameters for the BH model are 24 l – 33, where l denotes the number of leaf nodes (Jayaswal at al., 2005), whereas the BH+I model has an additional four parameters. The large increase in log-likelihood values (
93) for an additional four parameters provides evidence that the bacterial data set has a proportion of invariant sites that were not properly accounted for in previous phylogenetic analyses (Galtier and Gouy, 1995; Jayaswal et al., 2005; Ababneh et al., 2006a).
|
The ranks of the tree topologies under the two models differ slightly—e.g., T5 and T6 have higher log-likelihood values compared to T7 under the BH model but not under the BH+I model. Also, the estimate of β varies only slightly with the tree topology. This is in agreement with the observation that although maximum-likelihood estimates of invariant sites are tree dependent, large deviations are unlikely (Sullivan at al., 1996; Lockhart et al., 1998). The most likely tree obtained using the BH+I model is the same as that inferred by Galtier and Gouy (1995), Gupta (2000), Foster (2004), Jayaswal et al. (2005), and Ababneh et al. (2006a).
Table 2 shows the log-likelihoods obtained under the GTR, GTR+I, and GTR+
models for the 15 unrooted tree topologies. The ranks of the trees vary slightly from those obtained under the BH or BH+I model. For example, unlike the BH and BH+I models, T2 is the most likely tree under the GTR models.
|
To test whether a general Markov model with invariant sites is sufficient to explain the evolutionary process, we used a modification of Cox's (1962) test as described by Goldman (1993) for nested models. Using the maximum-likelihood estimates of the parameters (i.e., Q-matrices, β, P(A| inv), P(C| inv), P(G| inv), P(T| inv)) obtained for T1, we generated 1,000 pseudo-data sets. For each pseudo-data set, we calculated
= log LU– log L1, where log LU denotes the unconstrained log-likelihood and log L1 denotes the log-likelihood for T1. The unconstrained log-likelihood log LU equals
patternNpatternlog (Npattern/N), where Npattern denotes the number of occurences of a particular pattern in the data set and N denotes the total number of sites.
For the bacterial data set,
obs (i.e., the difference between unconstrained log-likelihood and log-likelihood of T1), is 229.93 and the proportion of pseudo-data sets with
obs was found to be 0.111 (Fig. 1). Because
obs lies within the expected distribution of
(i.e., the one generated using parametric bootstrap), we have evidence that the bacterial data could have arisen under a general Markov model with invariant sites.
|
To test the null hypothesis that the GTR+I model is sufficient to explain the bacterial data set against the alternate hypothesis that the BH+I model explains the data set, we generated 1,000 pseudo-data sets under the GTR+I model using Seq-Gen (Rambaut and Grassly 1997) for the parameters corresponding to T2. The pseudo-data sets were analyzed under both the models for T2, and the difference in log-likelihoods varied between 23.85 and 61.01; none of the values were larger than the observed difference of 113.91 (i.e., the difference between –4203.027 and –4316.18; values obtained from Tables 1 and 2). Thus, we have strong evidence against the null hypothesis, implying that the BH+I model provides a better fit for the bacterial data set and the large difference in log-likelihood values between the two models is not due to overparameterization.
To test the null hypothesis that a particular tree, T, is the true tree versus the alternate hypothesis that a tree other than T is the true tree, we applied the SOWH test (Swofford et al., 1996; Goldman et al., 2000; Shi et al., 2005) to each of the 15 unrooted trees. It is a parametric bootstrap test used for hypothesis testing when T is not selected a priori as the true tree, and the model of nucleotide substitution is correct. The test can be performed as follows:
- generate 1,000 pseudo-data sets under the null hypothesis;
- calculate
= log LML– log LT for each pseudo-data set, where log LML is the log-likelihood of the most likely tree and log LT is the log-likelihood of T;
- calculate p, the proportion of times that
obs, where
obs is the difference in the log-likelihood values of the most likely tree and T for the original data set. A large p-value would support the null hypothesis; otherwise, there is evidence against it.
The above-mentioned three steps were repeated for each of the 15 unrooted trees. Because T1 is the most likely tree,
obs = 0 and the p-value for the null hypothesis—that T1 is the true tree—is 1. The extremely low p-values (Table 3) for the remaining trees provide strong evidence against any tree other than T1 being the true tree.
|
In addition to the SOWH test, for each tree Ti, i = 1,...,15, we also determined the percentage of the pseudo-data sets where the most likely tree and the true tree were the same. Some of the trees were found to have low percentages (Table 3), suggesting that if one of these trees were correct, then the internal edge lengths would be very short (or almost 0). The Q-matrix along an edge of length 0 would be diagonal, and an analysis of the Q-matrices (estimated on the bacterial data set) for the trees with low percentages showed that this was indeed the case.
Figure 2 shows the labeled nodes for T1 and Table 4 shows the marginal probabilities at these nodes. The marginal probabilities were obtained after removing the invariant sites and can be divided into two groups. The first group comprises the internal nodes (1, 2, and 3) and the leaf nodes Aquifex, Thermotoga, and Thermus. The other group consists of the leaf nodes Bacillus and Deinococcus. The marginal probabilities of nodes within each group are similar but different from those in the other group. This is even more apparent when the GC content is considered (Fig. 2). The distribution of the marginal probabilities of the nodes belonging to the first group suggests that the subtree excluding the leaf nodes Bacillus and Deinococcus has a stationary Markov process acting on it. Furthermore, the Q-matrices along the edges of the subtree were found to be symmetric, which suggests that the subtree could be represented by a time-reversible model (R1). The Q-matrices along the edges leading to the leaf nodes Bacillus and Deinococcus were found to be similar with respect to one another. Table 5 shows the inferred divergence matrices along the two edges (obtained by multiplying the Q-matrices with the number of variable sites). The two matrices have a noticeable substitution bias of G
A, C
T, and G
T over A
G, T
C, and T
G, unlike the remaining edges, which had symmetric Q-matrices. This, coupled with the observation that the internal nodes 2 and 3 have similar marginal probabilities, suggests that another Markov model, R2 (where R1
R2), could be used to explain the two independent processes along the edges leading to Bacillus and Deinococcus. Such a scenario, based on two different rate matrices, was proposed by Foster (2004) and found to have a good fit for the bacterial data. Thus, in addition to providing a good fit, the BH+I model can help in identifying simpler models that are consistent with the data.
|
|
|
Comparison of Models for Estimating Invariant Sites
The estimate of β obtained using the BH+I model was compared with those obtained using other programs that can estimate β. We used the 1,000 pseudo-data sets generated on T1 and the maximum-likelihood estimates of the parameters (i.e., Q-matrices, β, P(A| inv), P(C| inv), P(G| inv), P(T| inv)) obtained from the bacterial data set and compared the β values obtained using the BH+I model with those obtained using the methods implemented in PHYML and SplitsTree. For PHYML, the β values were estimated using the GTR+I model and found to be close to those obtained using the BH+I model. In contrast, the capture-recapture method of SplitsTree returned significantly lower estimates of β (Table 6). The bias may be due to the fact that the sequences were generated under nonstationary and nonhomogeneous conditions and the formula used by this capture-recapture method for estimating invariant sites is not exact when the GTR or more complex substitution models are considered (Steel et al., 2000). For the BH+I model, the actual and estimated β values are close (less than one standard deviation apart: see Table 6), but the estimate of β is slightly below the expected value.
|
The invariant sites estimates obtained on T1 using BH+I and GTR+I models were close, but the difference in log-likelihood values was 160.182 ± 16.286 (i.e., the mean and sample standard deviation). Although the GTR+I model has 16 degrees of freedom for the five-taxon tree, the BH+I model has 91 degrees of freedom, resulting in 75 additional degrees of freedom for the BH+I model. The large difference in the log-likelihood values cannot be explained only by the increase in the number of degrees of freedom and thus provides additional evidence that the bacterial data have evolved under conditions that are more complex than those accounted for by the GTR+I model. Our result, therefore, corroborates results by Galtier and Gouy (1995), Foster (2004), Jayaswal et al. (2005), and Ababneh et al. (2006a).
The actual and estimated values of the proportion of invariant As, Cs, Gs, and Ts for the pseudo-data sets are shown in Table 7. The estimated values were obtained for T1 and are close to the actual values.
|
| Discussion |
|---|
|
|
|---|
The BH model is the most general Markov model of nucleotide evolution under the assumption that the processes at sites are independent and identically distributed (iid). As this assumption is often violated by real data, there is a need to consider rate heterogeneity across sites while using the general Markov model. The simplest approach would be to assume that (i) the sites can be divided into two groups depending on whether they are variable or invariant and (ii) variable sites are iid. This approach has been described in the current paper.
We used the BH+I model to analyze a set of bacterial ribosomal RNA genes, which have evolved under nonstationary, nonreversible, and nonhomogeneous conditions. Using parametric bootstrap analysis, we found that the optimal tree and the associated parameter values fit the bacterial data well. As the BH+I model provides a good fit, we used the parametric bootstrap to compare tree topologies and showed that there is a strong evidence for T1 being the true tree.
There are several benefits of using the BH+I model, such as (i) obtaining a better fit in terms of log-likelihood value; (ii) checking for substitutional biases along individual edges of a phylogenetic tree; and (iii) estimating the nucleotide composition of invariant sites even when the process is nonstationary, nonreversible, and nonhomogeneous. The number of parameters for the BH+I model is much greater than that for the GTR+I model, and it may be argued that there is a risk of overfitting the data using models that are not based on biological knowledge. Using the parametric bootstrap, we showed that the GTR+I model is not sufficient to fully explain the complexity of the bacterial data set. The pivotal question, therefore, is how many parameters are needed to model the evolution of a set of sequences.
It should also be borne in mind that a good fit between the model and the data only indicates that the model is consistent with the data. In the present case, a good fit was obtained between the data and a model that does not consider the ribosomal RNA molecule's secondary structure (which implies that many of the sites most likely are not independent). The BH+I model can also be used as a guide for identifying simpler models that appropriately explain the variation in sequence data. For example, our analysis of the Q-matrices showed that the bacterial data set could be described using two different Markovian processes and such a Markov model (with fewer parameters than the BH+I model) was proposed by Foster (2004). However, another data set might require a more parameter-rich model such as the BH+I model.
Another direction of future research would be to extend the BH+I model to include information on rate heterogeneity across variable sites. Although the EM algorithm is computationally slower than that described above, it could be extended to incorporate rate heterogeneity across sites; however, it would have an extremely large number of parameters and might be impractical from a computational point of view.
A final area of future research would be to partition the computational problem associated with these parameter-rich models such that programs implementing them can be executed on supercomputers or distributed systems of processors. Recent papers have shown that it is possible to obtain a near-linear speedup of phylogenetic analyses on such computational resources (Keane et al., 2005; Zhou et al., 2007).
| Acknowledgment |
|---|
|
|
|---|
We thank Mike Steel, Stephane Guindon, Derrik Zwickl, and an anonymous reviewer for their constructive comments and suggestions. This research was partly funded by Discovery Grants (DP0453173 and DP0556820) from the Australian Research Council.
| References |
|---|
|
|
|---|
-
Ababneh F., Jermiin L. S., Ma C., Robinson J. Matched-pairs tests of homogeneity with applications to homologous nucleotide sequences. Bioinformatics (2006a) 22:1225–1231.
Ababneh F., Jermiin L. S., Robinson J. Generation of the exact distribution and simulation of matched nucleotide sequences on a phylogenetic tree. J. Math. Model. Algor. (2006b) 5:291–308.[CrossRef]
Barry D., Hartigan J. A. Statistical analysis of hominoid molecular evolution. Stat. Sci. (1987) 2:191–210.[CrossRef]
Chang B. S. W., Campbell D. L. Bias in phylogenetic reconstruction of vertebrate Rhodopsin sequences. Mol. Biol. Evol. (2000) 17:1220–1231.
Conant G. C., Lewis P. O. Effects of nucleotide composition bias on the success of the parsimony criterion in phylogenetic inference. Mol. Biol. Evol. (2001) 18:1024–1033.
Cox D. R. Further results on tests of separate families of hypotheses. J. R. Stat. Soc. B (1962) 24:406–424.
Dempster A. P., Laird N. M., Rubin D. B. Maximum likelihood from incomplete data via an EM algorithm. J. R. Stat. Soc. B. (1977) 39:1–38.
Felsenstein J. Cases in which parsimony and compatibility methods will be positively misleading. Syst. Zool. (1978) 27:401–410.
Foster P. G. Modelling compositional heterogeneity. Syst. Biol. (2004) 53:485–495.
Foster P. G., Hickey D. A. Compositional bias may affect both DNA-based and protein-based phylogenetic reconstructions. J. Mol. Evol. (1999) 48:284–290.[CrossRef][Web of Science][Medline]
Galtier N., Gouy M. Inferring phylogenies from DNA sequences of unequal base compositions. Proc. Natl. Acad. Sci. USA (1995) 92:11317–11321.
Galtier N., Gouy M. Inferring pattern and process: Maximum-likelihood implementation of a nonhomogeneous model of DNA sequence evolution for phylogenetic analysis. Mol. Biol. Evol. (1998) 15:871–879.[Abstract]
Galtier N., Tourasse N., Gouy M. A nonhyperthermophilic common ancestor to extant life forms. Science (1999) 283:220–221.
Goldman N. Statistical tests of models of DNA substitution. J. Mol. Evol. (1993) 36:182–198.[CrossRef][Web of Science][Medline]
Goldman N., Anderson J. P., Rodrigo A. G. Likelihood-based tests of topologies in phylogenetics. Syst. Biol. (2000) 49:652–670.
Guindon S., Gascuel O. A simple, fast and accurate method to estimate large phylogenies by maximum-likelihood. Syst. Biol. (2003) 52:696–704.
Gupta R. S. The phylogeny of proteobacteria: Relationships to other eubacterial phyla and eukaryotes. FEMS Microbiol. Rev. (2000) 24:367–402.[CrossRef][Web of Science][Medline]
Hasegawa M., Kishino H., Saitou N. On the maximum likelihood method in molecular phylogenetics. J. Mol. Evol. (1991) 32:443–445.[CrossRef][Web of Science][Medline]
Hasegawa M., Kishino H., Yano T. Dating of the human-ape splitting by a molecular clock of mitochondrial DNA. J. Mol. Evol. (1985) 22:160–174.[CrossRef][Web of Science][Medline]
Ho S. Y. W., Jermiin L. S. Tracing the decay of the historical signal in biological sequence data. Syst. Biol. (2004) 53:623–637.
Huson D. H., Bryant D. Application of phylogenetic networks in evolutionary studies. Mol. Biol. Evol. (2006) 23:254–267.
Jayaswal V., Jermiin L. S., Robinson J. Estimation of phylogeny using a general Markov model. Evol. Bioinf. (2005) 1:62–80.
Jermiin L. S., Ho S. W. H., Ababneh F., Robinson J., Larkum A. W. D. The biasing effect of compositional heterogeneity on phylogenetic estimates may be underestimated. Syst. Biol. (2004) 53:638–643.
Jukes T. H., Cantor C. R. Evolution of protein molecules. In: Mammalian protein metabolism—Munro H. N., ed. (1969) New York: Academic Press. 21–132.
Keane T. M., Naughton T. J., Travers S. A. A., McInerney J. O., McCormack G. P. DPRml: Distributed phylogeny reconstruction by maximum likelihood. Bioinfomatics (2005) 21:969–974.
Kung S. Y., Mak M. W., Lin S. H. Biometric authentication: A machine learning approach (2004) New Jersey: Prentice Hall PTR.
Lanave C., Preparata G., Saccone C., Serio G. A new method for calculating evolutionary substitution rates. J. Mol. Evol. (1984) 20:86–93.[CrossRef][Web of Science][Medline]
Lockhart P. J., Larkum A. W. D., Steel M. A., Waddell P. J., Penny D. Evolution of chlorophyll and bacteriochlorophyll: The problem of invariant sites in sequence analysis. Proc. Natl. Acad. Sci. USA. (1996) 93:1930–1934.
Lockhart P. J., Steel M. A., Barbrook A. C., Huson D. H., Charleston M. A., Howe C. J. A covariotide model explains apparent phylogenetic structure of oxygenic photosynthetic lineages. Mol. Biol. Evol. (1998) 15:1183–1188.[Abstract]
Lockhart P. J., Steel M. A., Hendy M. D., Penny D. Recovering evolutionary trees under a more realistic model of sequence evolution. Mol. Biol. Evol. (1994) 11:605–612.[Web of Science][Medline]
Rambaut A., Grassly N. C. Seq-Gen: An application for the Monte Carlo simulation of DNA sequence evolution along phylogenetic trees. Comp. Appl. Biosci. (1997) 13:235–238.[Medline]
Reeves J. Heterogeneity in the substitution process of amino acid sites of proteins coded for by the mitochondrial DNA. J. Mol. Evol. (1992) 35:17–31.[CrossRef][Web of Science][Medline]
Shi X., Gu H., Susko E., Field C. The comparison of the confidence regions in phylogeny. Mol. Biol. Evol. (2005) 22:2285–2296.
Sidow A., Nguyen T., Speed T. P. Estimating the fraction of invariable codons with a capture-recapture method. J. Mol. Evol. (1992) 35:253–260.[Web of Science][Medline]
Steel M., Hendy M. D., Penny D. Parsimony can be consistent! Syst. Biol. (1993) 42:581–587.
Steel M., Huson D., Lockhart P. J. Invariable sites models and their use in phylogeny reconstruction. Syst. Biol. (2000) 49:225–232.
Sullivan J., Holsinger K. E., Simon C. The effect of topology on estimates of among-site rate variation. J. Mol. Evol. (1996) 42:308–312.[CrossRef][Web of Science][Medline]
Swofford D. L. PAUP*: Phylogenetic analysis using parsimony (*and other methods). Version 4 (2002) Sunderland, Massachusetts: Sinauer Associates.
Swofford D. L., Olsen G. J., Waddell P. J., Hillis D. M. Phylogenetic inference. In: Molecular systematics—Hillis D. M., Moritz C., Mable B. K., eds. (1996) Sunderland, Massachusetts: Sinauer Associates. 407–514.
Tamura K. Estimation of the number of nucleotide substitutions when there are strong transition-transversion and G+C content biases. Mol. Biol. Evol. (1992) 9:678–687.[Abstract]
Waddell P., Steel M. General time-reversible distances with unequal rates across sites: Mixing gamma and inverse Gaussian distributions with invariant sites. Mol. Phylogenet. Evol. (1997) 8:398–414.[CrossRef][Web of Science][Medline]
Wakeley J. Substitution rate variation among sites and the estimation of transition bias. Mol. Biol. Evol. (1994) 11:436–442.[Abstract]
Yang Z. Maximum likelihood estimation of phylogeny from DNA sequences when substitution rates differ over sites. Mol. Biol. Evol. (1993) 10:1396–1401.[Abstract]
Yang Z. Maximum likelihood phylogenetic estimation from DNA sequences with variable rates over sites: Approximate methods. J. Mol. Evol. (1994) 39:306–314.[CrossRef][Web of Science][Medline]
Yang Z., Goldman N., Friday A. E. Comparison of models for nucleotide substitution used in maximum likelihood phylogenetic estimation. Mol. Biol. Evol. (1994) 11:316–314.[Abstract]
Yang Z., Roberts D. On the use of nucleic acid sequences to infer early branchings in the tree of life. Mol. Biol. Evol. (1995) 12:451–458.[Abstract]
Zhou B. B., Till M., Zomaya A. Y., Jermiin L. S. A parallel implementation of the maximum-likelihood method for phylogenetic inference. Int. J. High Perform. Comp. Network. (2007) In press.
This article has been cited by other articles:
![]() |
R. G. Beiko, W. F. Doolittle, and R. L. Charlebois The Impact of Reticulate Evolution on Genome Phylogeny Syst Biol, December 1, 2008; 57(6): 844 - 856. [Abstract] [Full Text] [PDF] |
||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||












