© 2008 Society of Systematic Biologists
Synonymous Substitutions Substantially Improve Evolutionary Inference from Highly Diverged Proteins
Edited by Paul Lewis
1 Professional Programme for Agricultural Bioinformatics, Graduate School of Agricultural and Life Sciences, University of Tokyo Tokyo, Japan; E-mail: seo{at}iu.a.u-tokyo.ac.jp (T.-K.S.)
2 Laboratory of Biometrics and Bioinformatics, Graduate School of Agricultural and Life Sciences, University of Tokyo Tokyo, Japan
| Abstract |
|---|
|
|
|---|
Codon-and amino acid-substitution models are widely used for the evolutionary analysis of protein-coding DNA sequences. Using codon models, the amounts of both nonsynonymous and synonymous DNA substitutions can be estimated. The ratio of these amounts represents the strength of selective pressure. Using amino acid models, the amount of nonsynonymous substitutions is estimated, but that of synonymous substitutions is ignored. Although amino acid models lose any information regarding synonymous substitutions, they explicitly incorporate the information for amino acid replacement, which is empirically derived from databases. It is often presumed that when the protein-coding sequences are highly divergent, synonymous substitutions might be saturated and the evolutionary analysis may be hampered by synonymous noise. However, there exists no quantitative procedure to verify whether synonymous substitutions can be ignored; therefore, amino acid models have been arbitrarily selected. In this study, we investigate the issue of a statistical comparison between codon-and amino acid-substitution models. For this purpose, we propose a new procedure to transform a 20-dimensional amino acid model to a 61-dimensional codon model. This transformation reveals that amino acid models belong to a subset of the codon models and enables us to test whether synonymous substitutions can be ignored by using the likelihood ratio. Our theoretical results and analyses of real data indicate that synonymous substitutions are very informative and substantially improve evolutionary inference, even when the sequences are highly divergent. Therefore, we note that amino acid models should be adopted only after carefully investigating and discarding the possibility that synonymous substitutions can reveal important evolutionary information.
Keywords: Amino acid model; codon model; likelihood-ratio test; model comparison; synonymous saturation
Received April 28, 2007; Revised July 23, 2007; Accepted December 17, 2007
The DNA substitutions of protein-coding sequences are categorized into synonymous and nonsynonymous codon substitutions. Because nonsynonymous substitutions induce amino acid replacement, they may have been subjected to Darwinian selection during the evolutionary process. The relative occurrence of nonsynonymous substitutions with respect to synonymous substitutions is used to represent the strength of selective pressure (Yang and Bielawski, 2000). Whereas codon models (Goldman and Yang, 1994; Muse and Gaut, 1994) consider both synonymous and nonsynonymous DNA substitutions, amino acid models (Dayhoff et al., 1978; Jones et al., 1992; Adachi and Hasegawa, 1996; Whelan and Goldman, 2001) take into account only nonsynonymous amino acid replacement. Although amino acid models lose the information regarding synonymous substitutions, they incorporate the information for amino acid replacement, which is empirically derived from databases. It has been widely accepted that synonymous substitutions may saturate quickly as sequences diverge (e.g., Maynard Smith and Smith, 1996). In this case, the saturation of synonymous substitutions could mask the information provided by nonsynonymous substitutions; thus, amino acid models may be better than codon models. Unfortunately, there has been no quantitative procedure to verify whether synonymous substitutions can be ignored, and amino acid models have been arbitrarily selected.
It is crucial to select an appropriate substitution model for phylogenetic analysis. Adopting an inappropriate model may lead to inconsistent branch length estimates, which can result in an incorrect inference of phylogeny (Anderson and Swofford, 2004). Under the Bayesian framework, incorrect branch length estimates may result in wrong estimations of divergence time (Thorne et al., 1998) or mistaken trend estimations of selective pressure (Seo et al., 2004). Furthermore, it has been noted that the confidence level of a reconstructed tree may be affected by inappropriate model selection (Buckley, 2002). The likelihood-ratio test (Stuart and Ord, 1996) and various model selection criteria such as AIC (Akaike, 1974) and BIC (Schwarz, 1974) can be applied to select appropriate substitution models.
Although the problem of model selection has been individually addressed to a great extent for both codon (Yang and Nielsen, 1998, 2002; Nielsen and Yang, 1998; Yang et al., 2000) and amino acid (Adachi and Hasegawa, 1996; Whelan and Goldman, 2001; Abascal et al., 2005) models, a comparison between these two models has rarely been conducted. In our study, we consider a situation in which protein-coding DNA sequences are provided such that both codon and amino acid models are applicable. The log-likelihood scores for codon and amino acid models cannot be directly compared via conventional model selection criteria (Akaike, 1974; Schwarz, 1974; Stuart and Ord, 1996). This is because the dimensions of their data spaces differ—the dimension of an amino acid space is 20, whereas the standard codon table contains three stop codons and its dimension is 61. Although different codon tables may contain differing numbers of stop codons, we assume that the dimension of the codon space is 61 for the sake of convenience of description. When an amino acid model is adopted, a 61-dimensional data space is transformed to a 20-dimensional one by translating the codons into amino acids. The log-likelihood score corresponding to this action of "dimension reduction" (translation) should be added to that of the amino acid model so that the two models are comparable. In the subsequent section, we discuss the procedure to transform an amino acid model in order to make it comparable with the codon one. We propose two types of codon models that can be derived from an amino acid model. By comparing these two types of codon models, we can verify whether the information provided by synonymous substitutions can be ignored for phylogenetic analysis. We apply our procedure to the analysis of mammalian and yeast sequence data and demonstrate that synonymous substitutions are very informative even when the sequences are highly divergent.
| Methods |
|---|
|
|
|---|
Conventional Codon Models
For the conventional codon model (Goldman and Yang, 1994), the instantaneous transition rate from codon i to codon j is represented as follows:
|
| (1) |
,
, and
j represent the strength of selective pressure, relative occurrence of transition with respect to transversion, and frequency of codon j, respectively.
Is typically used to detect selection, but its interpretation requires caution (see Results and Discussion). In Equation 1, the instantaneous transition rate is proportional to the frequency of the replaced codon,
j. Other codon models (Muse and Gaut, 1994; Wong and Nielsen, 2004) have been developed in which the instantaneous transition rate is proportional to the frequency of the replaced nucleotide.
Conventional Amino Acid Models
For amino acid models, the instantaneous transition rate from amino acid ai to amino acid aj is represented as follows:
|
| (2) |
aj is the frequency of amino acid aj and saiaj represents the relative speed of replacement from ai to aj during infinitesimal time. The values of the parameter saiaj for various amino acid models have been estimated by parsimony-based (Dayhoff et al., 1978; Jones et al., 1992) and likelihood-based (Adachi and Hasegawa, 1996; Whelan and Goldman, 2001) procedures.
For the sake of convenience of numerical computation, it is typically assumed that the models are time reversible, and this implies that saiaj = sajai . The values for
aj are occasionally estimated from a sequence database, but substantial improvements in model fit can often be realized when the values for
aj are estimated from the particular protein family of interest (Cao et al., 1994). When the frequencies are derived solely from the protein families of interest, we follow convention by adding a "+F" suffix to the name of the amino acid model. A 20x20 instantaneous rate matrix R can be derived from Equation AminoModel. The matrix of the transition probabilities between amino acids during time t is calculated as exp {tR}. For the sake of convenience of mathematical notation, we do not consider any normalizing factor for the rate matrix.
Converting Amino Acid Model to Codon Model by Incorporating Synonymous Substitutions
We consider a new codon model in which the instantaneous transition rate from codons i to j is given as follows:
|
| (3) |
j is the frequency of codon j, and saiaj is obtained from the amino acid model given by Equation 2. We assume that
aj =
k:ak = aj
k. This assumption is easy to reconcile with estimates when the +F option is employed and the observed frequencies are used to estimate parameters
aj and
k. The free parameter
represents the speed of synonymous substitution relative to the various values of saiaj in infinitesimal time. We refer to this codon model as "SK-P1," where P1 implies that there exists one free parameter in the rate matrix. SK-P1 is designed for the conversion of an amino acid model to a codon model. In an amino acid model, all types of amino acid transitions are allowed in infinitesimal time, even if multiple nucleotide substitutions are involved. For the sake of consistency with the amino acid model, we allow instantaneous multiple nucleotide substitution in Equation SK-P1. Multiple-nucleotide substitutions in infinitesimal time is biologically plausible because it can be caused by compensatory mutation, recombination, gene conversion, etc., and its incorporation into the codon model can improve the likelihood score (Whelan and Goldman, 2004).
To measure the strength of selective pressure,
should be compared with the average of all saiaj values; this can be calculated as follows:
|
| (4) |
and dn/ds (see Results and Discussion).
For SK-P1, the transition probabilities between codons can be derived from the transition probabilities for amino acid models. We can show that the transition probabilities from codons i to j are represented as follows (see Appendix):
|
| (5) |
Codon Model Equivalent to Amino Acid Model
We consider another new codon model SK-P0, which is nested in SK-P1. In a fixed amount of time t, the amount of synonymous substitution increases with
in Equation SK-P1. However, the amount of nonsynonymous substitution remains constant because the rate matrix is not normalized. The amount of nonsynonymous substitution is the same as that of amino acid substitution in Equation 3. When synonymous substitutions are completely saturated (
=
), SK-P1 is reduced to SK-P0. Figure 1 provides a pictorial summary of the relationships between amino acid models and SK-P0 and SK-P1.
|
We note that SK-P0 is a codon model equivalent to the amino acid model. When
=
, Equation 5 is reduced to Pij(t) = Paiaj (t)
j/
aj . For the given sequence data D and vector t of branch lengths, let us denote the likelihoods of SK-P0 and the amino acid model as Lc(t| D) and La(t| D), respectively. Assuming that there are T sequences with length n codons, we can show the following relationship (see Appendix):
|
| (6) |
| Results and Discussion |
|---|
|
|
|---|
Information Pertaining to Synonymous Substitutions: Cases of Mammals and Yeast
We analyzed two empirical data sets—12 mammalian mitochondrial (Nikaido et al., 2003) and 106 yeast (Rokas et al., 2003) protein-coding genes. The mitochondrial genomes of 69 mammalian species were downloaded from GenBank (refer to Nikaido et al., 2003 for accession numbers). Except for NADH6, all 12 protein-coding genes were aligned using ClustalW (Chenna et al., 2003) with the default option and then manually edited. The ML method was adopted to estimate the rate parameters and branch lengths for Tree 6 given by Nikado et al. The 106 yeast protein-coding genes were provided by Rokas et al. (2003). For each gene, the ML tree estimated by Ren et al. (2005) was used for the ML estimation of branch lengths and rate parameter
. It is known that mammalian mitochondria evolve fast (Brown et al., 1979) and the yeast taxa share very old divergence times (Dujon, et al, 2004). The purpose of our analysis is to test if divergent sequences still contain informative signals pertaining to synonymous substitutions. In our analysis, we converted the mtREV+F amino acid model (Adachi and Hasegawa, 1996) to SK-P0 and SK-P1 for mammalian mitochondrial data and the WAG+F amino acid model (Whelan and Goldman, 2001) to SK-P0 and SK-P1 for yeast data. Other amino acid models (Dayhoff et al., 1978; Jones et al., 1992) can be converted in a straightforward manner.
Figure 2 shows the importance of synonymous substitutions by comparing the log-likelihood scores for SK-P0 and SK-P1 for mammalian (Fig. 2a) and yeast (Fig. 2b) genes. The log-likelihood scores for SK-P0 are represented on the horizontal axes, whereas the improved log-likelihood scores for SK-P1 are represented on the vertical axes. SK-P1 is reduced to SK-P0 when
=
. Because infinity is located on the boundary of the parameter space, a 50:50 mixture of
20 and
21 distributions could be adopted to verify the significance of twice the log-likelihood difference (Self and Liang, 1987) (2
loglike). The 5% rejection region of this mixed distribution is (2.71,
). All score differences are highly significant (p << 0.001). This result indicates that SK-P0 is an inappropriate model because it ignores synonymous substitutions.
|
Relationship between Synonymous Saturation and Evolutionary Distance
The analysis of mammal and yeast data reveals that synonymous substitutions are very important and should not be ignored. To verify whether this result can be generalized to other protein-coding sequences, we investigate the circumstances under which SK-P1 becomes significantly better than SK-P0. We consider a simple scenario comprising two sequences separated by an evolutionary distance d. The appropriate normalizing factor of Equation 3 is taken into account such that d represents the number of codon substitutions per codon site. To further simplify the investigation, we do not consider the uncertainty of 2
loglike. For various values of d, we calculate the minimum codon length (n) such that E[2
loglike] is located within a 5% rejection region of the mixed distribution as follows.
For codons i and j at the pth site of the two sequences, let us denote the likelihoods for SK-P1 and SK-P0 as f1(xp|
, t,
) and f0(xp| t,
), respectively. Then,
|
|
|
|
|
|
). The interval of (2.71,
) is a 5% rejection region under the 50:50 mixture of
20 and
21 distributions. For the given values of
i,
, and t, the evolutionary distance (d) is calculated as t
i{
i
j
iRij}, where the values of Rij are those obtained from Equation 3.
Figure 3 is obtained by adopting the estimates of
i values and
from COX1 as the true parameters of Equation 3. The estimated
(
, dn/ds) of COX1 is 0.00174. Based on this ratio and the
i values of COX1, 1.0 codon substitutions per codon site correspond to 0.963 synonymous and 0.037 nonsynonymous substitutions per codon site. Although only two sequences are considered and the uncertainty of 2
loglike is not represented, Figure 3 provides a rough insight into the range of the sequence length and the evolutionary distance within which synonymous substitutions should not be ignored. Figure 3 shows that SK-P0 is strongly rejected for moderately long sequence pairs separated by a small evolutionary distance (d). In the analysis of more than two sequences, the superiority of SK-P1 is more pronounced because reconstructed phylogenies typically contain short branches, and SK-P1 offers a significant advantage for such short branches. Moreover, E[2
loglike] increases with the number of available sequences; thus, the probability of SK-P0 being rejected increases. Therefore, we expect that the advantage offered by SK-P1 over SK-P0 is not specific to mammals and yeasts but is applicable to all protein-coding sequences.
|
Effect of Ignoring Synonymous Substitutions
When a protein is subjected to selective pressure, the resulting nonsynonymous substitution is not neutral, and its evolutionary pattern may differ from that of a synonymous substitution. Therefore, considering only nonsynonymous substitutions and ignoring synonymous substitutions may distort the estimations of phylogeny and divergence times. Consistent with previous findings (Anderson and Swofford, 2004), Figure 4 shows that the branch-length estimation can be significantly influenced by model selection. The trees of mammalian COX1 reconstructed under SK-P0 (red) and SK-P1 (blue) are superposed by aligning the longest branch lengths from the root to each tip. The red (blue) bar at the bottom represents 0.1 nonsynonymous substitutions (codon substitution) per codon site. The red tree is identical to the tree reconstructed under the mtREV+F amino acid model because SK-P0 is equivalent to mtREV+F in our analysis of mammalian mitochondrial protein-coding genes. In Figure 4, the significant deviation from proportionality is consistent with the fact that COX1 has been subjected to different selective pressures in different lineages during mammalian evolution (Wu et al., 2000). When the red and blue trees are roughly proportional to each other (e.g., NADH2; data not shown), we expect that the nonsynonymous substitutions have been subjected to roughly constant selective pressures. Even when the two trees are roughly proportional to each other, we note that ignoring synonymous substitutions is unacceptable; this is because SK-P1 is significantly better than SK-P0, as shown in Figure 2.
|
Merging SK-P1 and Conventional Codon Models
One notable feature of SK-P0 and SK-P1 is that they require considerably less computation than conventional 61-dimensional codon models (Goldman and Yang, 1994; Muse and Gaut, 1994). They require the execution of a 20x 20 amino acid rate matrix instead of a 61x 61 codon rate matrix; this is because the transition probabilities between codons can be calculated from the transition probabilities between amino acids (Equation 5).
The conventional modification of codon models (Yang and Nielsen, 1998; Nielsen and Yang, 1998) can be similarly applied to SK-P1. That is, we can allow different lineages of phylogeny or different sites to have different values of parameter
. The heterogeneity of
among lineages and sites can be simultaneously considered, as shown in a previous study (Yang and Nielsen, 2002). Furthermore, we note that any amino acid model can be incorporated into SK-P1. We analyzed the mammalian data by incorporating the saiaj values of the JTT+F and WAG+F amino acid models (Jones et al., 1992; Cao et al., 1994; Whelan and Goldman, 2001) and found that the overall results in Figure 2 do not change (data not shown).
Goldman and Yang's (1994) conventional codon model, presented in Equation 1 (GY94), can be generalized by including the estimates of the saiaj parameters from amino acid-replacement models. We refer to the resulting model as SK-P2, and its instantaneous transition rate from codon type i to codon type j is
|
| (7) |
to nonsynonymous substitutions (Equation 1), SK-P2 assigns a free parameter
to synonymous substitutions. As in the case of SK-P1, the ratio |
|
, 1, or 0 if the replacement from i to j involves one transition, one transversion, or more than one nucleotide substitution, respectively.
Phenotypic information such as the physical properties of amino acids (Goldman and Yang, 1994; Hasegawa, 1998) and the structural stability of entire protein sequences (Robinson et al., 2003) can be incorporated into codon models. In SK-P2, a database-derived amino acid model is incorporated (see the related approaches by Doron-Faigenboim and Pupko, 2007, and Kosiol et al., 2007). Just as phenotypic information may improve codon models, amino acid-replacement models may improve codon models. As a preliminary data analysis, we compared the log-likelihood scores with GY94 and SK-P2. For the values of the SK-P2 parameters saiaj and
j, the mtREV+F amino acid-replacement model was used. There were 11 mammalian mitochondrial genes that showed improved scores with SK-P2 and only one relatively short gene of about 70 codons (ATP8) that had a poor score for SK-P2 (data not shown).
When saiaj
1 for all ai
aj and
= 1/
(
0) in Equation 7, SK-P2 becomes equivalent to GY94 (Equation 1). The amino acid-replacement model with all saiaj
1 corresponds to the "proportional" model of amino acid replacement (Cao et al., 1994). That is, GY94 is SK-P2 in which the proportional model is incorporated. The proportional model is the simplest amino acid model because the database-derived estimates of the saiaj parameters are not needed. Because any amino acid model can be incorporated into SK-P2, its applicability is more flexible than that of GY94. As in Equation 7, Muse and Gaut's (1994) codon model, in which the instantaneous transition rate is proportional to the frequency of replaced nucleotide, can be generalized by incorporating saiaj values; however, it is not possible to exactly equate the SK-P2 model to Muse and Gaut's codon model.
Interpretation of
, dn/ds, and 

Typically,
of Equation 1 is used to represent the strength of selective pressure. Another conventional measure of selective pressure is dn/ds (e.g., Nei and Gojobori, 1986), where dn (ds) represents the number of nonsynonymous (synonymous) substitutions per nonsynonymous (synonymous) site. In this section, we focus on a discussion on dn/ds, which also applies to
and
. The threshold value of dn/ds is usually set to be one. A dn/ds value of less than one is interpreted as a purifying selection, whereas that exceeding one is regarded as a positive selection. However, as noted by Ohta and Basten (1992), sequences may have been subjected to positive selection even when dn/ds is less than one.
Let us assume that the synonymous mutation rate per generation per sequence is µ. Under neutral evolution, the genetic diversity of a synonymous site within a population is represented by 4Nµ, where N is the effective population size, and the fixation rate of a synonymous mutant within the population is equal to the mutation rate per generation (µ) (Kimura, 1983). Similarly, let us assume that the nonsynonymous mutation rate per generation is v. As in the case of synonymous mutation, the genetic diversity and fixation rate of the nonsynonymous substitution are 4Nv and v, respectively, under neutral evolution. For simplicity, we assume that all the third codon sites are synonymous and all the first and second codon sites are nonsynonymous. Then, the number of nonsynonymous sites is twice that of synonymous sites, and dn and ds are v/(2L) and µ/L, respectively, where L is the number of codon sites. The value of dn/ds is one when 2µ = v. However, this is not the condition required by neutral evolution. Both synonymous and nonsynonymous mutants can evolve neutrally irrespective of the relationship between µ and v. Neutrality is the evolutionary process after a mutant originates within a population, and it does not restrict the mutation rates for different types of mutants. This can be more clearly and easily understood when we consider the application of DNA substitution models such as HKY85 (Hasegawa et al., 1985) to noncoding DNA sequences. Although the evolution of noncoding sequences is expected to be almost neutral, transition-type substitutions have a much higher evolutionary rate than transversion-type substitutions. This is due to their differing mutation rates, and not due to positive selection.
McDonald and Kreitman (1991) observed that positive selection is operative on the alcohol dehydrogenase (ADH) sequences from three Drosophila species by classifying the codon substitutions into the following 2 x 2 categories: fixed/polymorphic and synonymous/nonsynonymous substitutions. The number of nonsynonymous and fixed (nonsynonymous and polymorphic) sites is 7 (2). The number of synonymous and fixed (synonymous and polymorphic) sites is 17 (42). If nonsynonymous substitutions are selectively neutral, the proportion of fixed nonsynonymous sites relative to the polymorphic nonsynonymous sites should be equal to that of synonymous sites. However, the former was much larger than the latter in ADH (7/2 > 17/42; p < 0.01). The numbers of fixed and polymorphic nonsynonymous sites are smaller than those of synonymous sites. However, nonsynonymous substitutions fix more quickly than synonymous substitutions, which is indicative of positive selection. Based on McDonald and Kreitman's data, dn/ds = {7/(2L)}/{17/L} = 7/34 < 1. In other words, dn/ds is less than one, although the sequences have been positively selected.
The interpretation of our new measure of
is similar to that of
and dn/ds. When these quantities are larger than one, we can consider the sequences to have undergone positive selection. Because the nonsynonymous mutation rate would be lower than synonymous mutation rate in most situations, the increase in dn relative to ds would be due to positive selection. However, it is hard to determine whether the sequences were under purifying selection even when dn/ds < 1, as can be seen in the case of McDonald and Kreitman's data. Therefore, a simple and safe interpretation of the three measures is "the relative occurrence of nonsynonymous substitution with respect to synonymous substitution"; the three measures are compatible in this sense. Finally, we note that "
" does not indicate strict neutrality when there is no difference between synonymous and nonsynonymous mutation rates. When
, any sequence generated with the SK-P1 or SK-P2 model would be different from that generated by a simple model such as HKY85 (Hasegawa et al., 1985) and noticeably different from intergenic neutral DNA.
| Concluding Remarks |
|---|
|
|
|---|
In this study, we developed new codon models that are derived from amino acid models. By employing these codon models, it is possible to statistically compare amino acid and codon models. Through the analysis of real data and theoretical results, we showed that synonymous substitutions are very informative and substantially improve evolutionary inference, even when the sequences are highly divergent. This indicates that we should be very careful while selecting the appropriate model between the codon and amino acid ones for the analysis of protein-coding sequences. We observed that the SK-P1 model proposed in this study is a good alternative to the amino acid model in the cases wherein synonymous substitutions cannot be ignored. We expect that our procedure to quantify synonymous information will play a critical role in the evolutionary analysis of protein-coding DNA sequences, which are gaining importance in the current genomic era.
| APPENDIX |
|---|
|
|
|---|
Proof of Equation 5
Let us denote the amino acid rate matrix of Equation 2 as Ra:
|
| (8) |
|
| (9) |
i(a) are a pair of eigenvectors and eigenvalues of Ra, and Ua Va = Va Ua = I. The transition probabilities among amino acids can be calculated using the matrix exponential
|
| (10) |
Now, let us denote the 61 x 61 codon rate matrix in Equation 3 as Rc,
|
|
|
| (11) |
We found that the eigenvalues and eigenvectors of Rc are explicitly expressed using those of Ra,
, and the frequencies of amino acids and codons. The eigenvalues (
i(c), 1
i
61) of Rc are
|
|
(i) is the index of the first codon that encodes ai. For example,
(1) =
(2) =
(3) = 1 because the first codon (ATT) encoding Ile is located in the first row and column of Rc.
(4) =
(5) =
(6) =
(7) = 4 because the first codon (GTT) encoding Val is located in the fourth row and column of Rc.
ai (a) is the eigenvalue of Ra corresponding to ai in Equation 2. Raiai is obtained from Ra given by Equation 2. Then, Dc is represented as
|
| (12) |
The elements of the eigenvector u(c)i = (u1,i(c),
, u61,i(c))T corresponding to the eigenvalue
i(c) are expressed as
|
|
|
| (13) |
Likewise, Vc is expressed in terms of Va and the frequencies of amino acids and codons. The elements of row vector viT(c) = (vi,1(c),
, vi,61(c)) in Equation 11 can be shown as
|
|
|
| (14) |
The transition probabilities between codons can be calculated using the matrix exponential
|
| (15) |
|
| (16) |
Remark.—When
=
, SK-P1 is reduced to SK-P0, which is equivalent to the amino acid model. In other words, the amino acid model can be regarded as a codon model in which synonymous substitution is completely saturated. The transition probability of SK-P0 is also derived by integrating the transition probability of SK-P1 over
, assuming that no information for
is available:
|
|
=
.
Proof of Equation 6
We prove Equation 6 for a case of four taxa. The proof is straightforwardly extended for the case of more than four taxa. suppose that we observe four terminal codons p, q, r, and s at the jth site. The unknown ancestral codon of p and q (r and s) is x (y). The branch length from x to p (q) is t1 (t2) and the branch length from y to r (s) is t3 (t4). The branch length between x and y is t5. Let us denote the likelihoods at the jth site for SK-P0 and the amino acid model as Lc(j) and La(j), respectively.
Then, the likelihood at the jth site for SK-P0 is
|
|
The product of all sitewise likelihoods results in Equation 6.
|
|
Remark.—Equation 6 is applicable when the codon frequencies are known and are consistent with the amino acid frequencies. When the data are provided in the form of amino acid sequences, only the amino acid frequencies are known. In this case, we can consider a "flat" Dirichlet distribution for the unknown frequencies of codons that encode the same amino acid. With this Dirichlet distribution, the average of
cij /
acij is 1/nacij , where nacij is the number of codons that encode amino acid acij . Therefore, Equation 6 is transformed into
|
| (17) |
| Acknowledgments |
|---|
|
|
|---|
T.-K.S. was supported by JSPS (EYS B-18770216), and H.K. was supported by JSPS (SR B-16300086). We thank Jeff Thorne, Joe Felsenstein, Paul Lewis, Jack Sullivan, and two anonymous reviewers for providing valuable comments, A. Rokas for providing the sequence data, and Z. Yang for providing the 106 ML trees of yeast.
| References |
|---|
|
|
|---|
-
Abascal F., Zardoya R., Posada D. ProtTest: Selection of best-fit models of protein evolution. Bioinformatics (2005) 21:2104–2105.
Adachi J., Hasegawa M. Model of amino acid substitution in proteins encoded by mitochondrial DNA. J. Mol. Evol. (1996) 42:459–468.[Web of Science][Medline]
Akaike H. A new look at the statistical model identification. IEEE Trans. Autom. Contr. (1974) 19:716–723.[CrossRef]
Anderson F. E., Swofford D. L. Should we be worried about long-branch attraction in real data sets? Investigations using metazoan 18S rDNA. Mol. Phylogenet. Evol. (2004) 33:440–451.[CrossRef][Web of Science][Medline]
Brown W. M., George M., Wilson A. C. Rapid evolution of animal mitochondrial DNA. Proc. Natl. Acad. Sci. USA (1979) 76:1967–1971.
Buckley T. R. Model misspecification and probabilistic tests of topology: Evidence from empirical data sets. Syst. Biol. (2002) 51:509–523.
Cao Y., Adachi J., Janke A., Paabo S., Hasegawa M. Phylogenetic relationships among eutherian orders estimated from inferred sequences of mitochondrial proteins: Instability of a tree based on a single gene. J. Mol. Evol. (1994) 39:519–527.[Web of Science][Medline]
Chenna R., Sugawara H., Koike T., Lopez R., Gibson T. J., Higgins D. G., Thompson J. D. Multiple sequence alignment with the Clustal series of programs. Nucleic Acids Res. (2003) 31:3497–3500.
Dayhoff M. O., Schwartz R. M., Orcutt B. C. A model of evolutionary change in proteins. In: Atlas of protein sequence and structure—Dayhoff M. O., ed. (1978) Washington, DC: National Biomedical Research Foundation. 345–352.
Doron-Faigenboim A., Pupko T. A combined empirical and mechanistic codon model. Mol. Biol. Evol. (2007) 24:388–397.
Dujon B., Sherman D., Fischer G., Durrens P., Casaregola S., Lafontaine I., de Montigny J., Marck C., Neuvéglise C., Talla E., Goffard N., Frangeul L., Aigle M., Anthouard V., Babour A., Barbe V., Barnay S., Blanchin S., Beckerich J.-M., Beyne E., Bleykasten C., Boisramé A., Boyer J., Cattolico L., Confanioleri F., de Daruvar A., Despons L., Fabre E., Fairhead C., Ferry-Dumazet H., Groppi A., Hantraye F., Hennequin C., Jauniaux N., Joyet P., Kachouri R., Kerrest A., Koszul R., Lemaire M., Lesur I., Ma L., Muller H., Nicaud J.-M., Nikolski M., Oztas S., Ozier-Kalogeropoulos O., Pellenz S., Potier S., Richard G.-F., Straub M.-L., Suleau A., Swennen D., Tekaia F., Wésolowski-Louvel M., Westhof E., Wirth B., Zeniou-Meyer M., Zivanovic I., Bolotin-Fukuhara M., Thierry A., Bouchier C., Caudron B., Scarpelli C., Gaillardin C., Weissenbach J., Wincker P., Souciet J.-L. Genome evolution in yeasts. Nature. 430:35–44.
Goldman N., Yang Z. A codon-based model of nucleotide substitution for protein-coding DNA sequences. Mol. Biol. Evol. (1994) 11:725–736.[Abstract]
Hasegawa M., Kishino H., Yano T. Dating of the human-ape splitting by a molecular clock of mitochondrial DNA. J. Mol. Evol. (1985) 22:160–174.[CrossRef][Web of Science][Medline]
Jones D. T., Taylor W. R., Thornton J. M. The rapid generation of mutation data matrices from protein sequences. Comput. Appl. Biosci. (1992) 8:275–282.
Kimura M. The neutral theory of molecular evolution. (1983) Cambridge, UK: Cambridge University Press.
Kosiol C., Holmes I., Goldman N. An empirical codon model for protein sequence evolution. Mol. Biol. Evol. (2007) 24:1464–1479.
Maynard Smith J., Smith N.H. Synonymous nucleotide divergence: What is saturation? Genetics (1996) 142:1033–1036.[Abstract]
McDonald J. H., Kreitman M. Adaptive protein evolution at the Adh locus in Drosophila. Nature (1991) 351:652–654.[CrossRef][Medline]
Muse S. V., Gaut B. S. A likelihood approach for comparing synonymous and nonsynonymous nucleotide substitution rates with application to the chloroplast genome. Mol. Biol. Evol. (1994) 11:715–724.[Abstract]
Nei M., Gojobori T. Simple methods for estimating the numbers of synonymous and nonsynonymous nucleotide substitutions. Mol. Biol. Evol. (1986) 3:418–426.[Abstract]
Nielsen R., Yang Z. Likelihood models for detecting positively selected amino acid sites and applications to the HIV-1 envelope gene. Genetics (1998) 148:929–936.
Nikaido M., Cao Y., Harada M., Okada N., Hasegawa M. Mitochondrial phylogeny of hedgehogs and monophyly of Eulipotyphla. Mol. Phylogenet. Evol. (2003) 28:276–284.[CrossRef][Web of Science][Medline]
Ohta T., Basten C. J. Gene conversion generates hypervariability at the variable regions of Kallikreins and their inhibitors. Mol. Phylgenet. Evol. (1992) 1:87–90.[CrossRef]
Ren F., Tanaka H., Yang Z. An empirical examination of the utility of codon-substitution models in phylogeny reconstruction. Syst. Biol. (2005) 54:808–818.
Robinson D. M., Jones D. T., Kishino H., Goldman N., Thorne J. L. Protein evolution with dependence among codons due to tertiary structure. Mol. Biol. Evol. (2003) 20:1692–1704.
Rokas A., Williams B. L., King N., Carroll S. B. Genome-scale approaches to resolving incongruence in molecular phylogenies. Nature (2003) 425:798–804.[CrossRef][Medline]
Schwarz G. Estimating the dimension of a model. Ann. Stat. (1974) 6:461–464.[CrossRef]
Self S. G., Liang K.-L. Asymptotic properties of maximum likelihood estimators and likelihood ratio tests under nonstandard conditions. J. Am. Stat. Assoc. (1987) 82:605–610.[CrossRef][Web of Science]
Seo T.-K., Kishino H., Thorne J. L. Estimating absolute rates of synonymous and nonsynonymous nucleotide substitution in order to characterize natural selection and date species divergences. Mol. Biol. Evol. (2004) 21:1201–1213.
Stuart A., Ord K. Kendall's advanced theory of statistics (1996) Volume 2, 5th. London, UK: Edward Arnold.
Thorne J. L., Kishino H., Painter I. S. Estimating the rate of evolution of the rate of molecular evolution. Mol. Biol. Evol. (1998) 15:1647–1657.[Abstract]
Whelan S., Goldman N. A general empirical model of protein evolution derived from multiple protein families using a maximum-likelihood approach. Mol. Biol. Evol. (2001) 18:691–699.
Whelan S., Goldman N. Estimating the frequency of events that cause multiple-nucleotide changes. Genetics (2004) 167:2027–2043.
Wong W. S. W., Nielsen R. Detecting selection in noncoding regions of nucleotide sequences. Genetics (2004) 167:949–958.
Wu W., Schmidt T. R., Goodman M., Grossman L. I. Molecular evolution of cytochrome c oxidase subunit I in primates: Is there coevolution between mitochondrial and nuclear genomes? Mol. Phylogenet. Evol. (2000) 17:294–304.[CrossRef][Web of Science][Medline]
Yang Z., Bielawski J. P. Statistical methods for detecting molecular adaptation? Trends Ecol. Evol. (2000) 15:496–503.[CrossRef][Medline]
Yang Z., Nielsen R. Synonymous and nonsynonymous rate variation in nuclear genes of mammals. J. Mol. Evol. (1998) 46:409–418.[CrossRef][Web of Science][Medline]
Yang Z., Nielsen R. Codon-substitution models for detecting molecular adaptation at individual sites along specific lineages. Mol. Biol. Evol. (2002) 19:908–917.
Yang Z., Nielsen R., Goldman N., Pedersen A. K. Codon-substitution models for heterogeneous selection pressure at amino acid sites. Genetics (2000) 155:431–449.
Yang Z., Nielsen R., Hasegawa M. Models of amino acid substitution and applications to mitochondrial protein evolution. Mol. Biol. Evol. (1998) 15:1600–1611.[Abstract]
This article has been cited by other articles:
![]() |
T.-K. Seo and H. Kishino Statistical Comparison of Nucleotide, Amino Acid, and Codon Substitution Models for Evolutionary Analysis of Protein-Coding Sequences Syst Biol, June 29, 2009; (2009) syp015v1. [Abstract] [Full Text] [PDF] |
||||
![]() |
M. Anisimova and C. Kosiol Investigating Protein-Coding Sequence Evolution with Probabilistic Codon Substitution Models Mol. Biol. Evol., February 1, 2009; 26(2): 255 - 271. [Abstract] [Full Text] [PDF] |
||||
![]() |
W. Delport, K. Scheffler, and C. Seoighe Models of coding sequence evolution Brief Bioinform, January 1, 2009; 10(1): 97 - 109. [Abstract] [Full Text] [PDF] |
||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||


























