Skip Navigation

Systematic Biology 2008 57(3):367-377; doi:10.1080/10635150802158670
This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (PDF) Freely available
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in ISI Web of Science
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrow Search for citing articles in:
ISI Web of Science (3)
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Seo, T.-K.
Right arrow Articles by Kishino, H.
Right arrow Search for Related Content
PubMed
Right arrow Articles by Seo, T.-K.
Right arrow Articles by Kishino, H.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?

© 2008 Society of Systematic Biologists

Synonymous Substitutions Substantially Improve Evolutionary Inference from Highly Diverged Proteins

Edited by Paul Lewis

Tae-Kun Seo1 and Hirohisa Kishino2

1 Professional Programme for Agricultural Bioinformatics, Graduate School of Agricultural and Life Sciences, University of Tokyo Tokyo, Japan; E-mail: seo{at}iu.a.u-tokyo.ac.jp (T.-K.S.)
2 Laboratory of Biometrics and Bioinformatics, Graduate School of Agricultural and Life Sciences, University of Tokyo Tokyo, Japan


    Abstract
 Top
 Abstract
 Methods
 Results and Discussion
 Concluding Remarks
 APPENDIX
 Acknowledgments
 References
 
Codon-and amino acid-substitution models are widely used for the evolutionary analysis of protein-coding DNA sequences. Using codon models, the amounts of both nonsynonymous and synonymous DNA substitutions can be estimated. The ratio of these amounts represents the strength of selective pressure. Using amino acid models, the amount of nonsynonymous substitutions is estimated, but that of synonymous substitutions is ignored. Although amino acid models lose any information regarding synonymous substitutions, they explicitly incorporate the information for amino acid replacement, which is empirically derived from databases. It is often presumed that when the protein-coding sequences are highly divergent, synonymous substitutions might be saturated and the evolutionary analysis may be hampered by synonymous noise. However, there exists no quantitative procedure to verify whether synonymous substitutions can be ignored; therefore, amino acid models have been arbitrarily selected. In this study, we investigate the issue of a statistical comparison between codon-and amino acid-substitution models. For this purpose, we propose a new procedure to transform a 20-dimensional amino acid model to a 61-dimensional codon model. This transformation reveals that amino acid models belong to a subset of the codon models and enables us to test whether synonymous substitutions can be ignored by using the likelihood ratio. Our theoretical results and analyses of real data indicate that synonymous substitutions are very informative and substantially improve evolutionary inference, even when the sequences are highly divergent. Therefore, we note that amino acid models should be adopted only after carefully investigating and discarding the possibility that synonymous substitutions can reveal important evolutionary information.

Keywords: Amino acid model; codon model; likelihood-ratio test; model comparison; synonymous saturation

Received April 28, 2007; Revised July 23, 2007; Accepted December 17, 2007


The DNA substitutions of protein-coding sequences are categorized into synonymous and nonsynonymous codon substitutions. Because nonsynonymous substitutions induce amino acid replacement, they may have been subjected to Darwinian selection during the evolutionary process. The relative occurrence of nonsynonymous substitutions with respect to synonymous substitutions is used to represent the strength of selective pressure (Yang and Bielawski, 2000). Whereas codon models (Goldman and Yang, 1994; Muse and Gaut, 1994) consider both synonymous and nonsynonymous DNA substitutions, amino acid models (Dayhoff et al., 1978; Jones et al., 1992; Adachi and Hasegawa, 1996; Whelan and Goldman, 2001) take into account only nonsynonymous amino acid replacement. Although amino acid models lose the information regarding synonymous substitutions, they incorporate the information for amino acid replacement, which is empirically derived from databases. It has been widely accepted that synonymous substitutions may saturate quickly as sequences diverge (e.g., Maynard Smith and Smith, 1996). In this case, the saturation of synonymous substitutions could mask the information provided by nonsynonymous substitutions; thus, amino acid models may be better than codon models. Unfortunately, there has been no quantitative procedure to verify whether synonymous substitutions can be ignored, and amino acid models have been arbitrarily selected.

It is crucial to select an appropriate substitution model for phylogenetic analysis. Adopting an inappropriate model may lead to inconsistent branch length estimates, which can result in an incorrect inference of phylogeny (Anderson and Swofford, 2004). Under the Bayesian framework, incorrect branch length estimates may result in wrong estimations of divergence time (Thorne et al., 1998) or mistaken trend estimations of selective pressure (Seo et al., 2004). Furthermore, it has been noted that the confidence level of a reconstructed tree may be affected by inappropriate model selection (Buckley, 2002). The likelihood-ratio test (Stuart and Ord, 1996) and various model selection criteria such as AIC (Akaike, 1974) and BIC (Schwarz, 1974) can be applied to select appropriate substitution models.

Although the problem of model selection has been individually addressed to a great extent for both codon (Yang and Nielsen, 1998, 2002; Nielsen and Yang, 1998; Yang et al., 2000) and amino acid (Adachi and Hasegawa, 1996; Whelan and Goldman, 2001; Abascal et al., 2005) models, a comparison between these two models has rarely been conducted. In our study, we consider a situation in which protein-coding DNA sequences are provided such that both codon and amino acid models are applicable. The log-likelihood scores for codon and amino acid models cannot be directly compared via conventional model selection criteria (Akaike, 1974; Schwarz, 1974; Stuart and Ord, 1996). This is because the dimensions of their data spaces differ—the dimension of an amino acid space is 20, whereas the standard codon table contains three stop codons and its dimension is 61. Although different codon tables may contain differing numbers of stop codons, we assume that the dimension of the codon space is 61 for the sake of convenience of description. When an amino acid model is adopted, a 61-dimensional data space is transformed to a 20-dimensional one by translating the codons into amino acids. The log-likelihood score corresponding to this action of "dimension reduction" (translation) should be added to that of the amino acid model so that the two models are comparable. In the subsequent section, we discuss the procedure to transform an amino acid model in order to make it comparable with the codon one. We propose two types of codon models that can be derived from an amino acid model. By comparing these two types of codon models, we can verify whether the information provided by synonymous substitutions can be ignored for phylogenetic analysis. We apply our procedure to the analysis of mammalian and yeast sequence data and demonstrate that synonymous substitutions are very informative even when the sequences are highly divergent.


    Methods
 Top
 Abstract
 Methods
 Results and Discussion
 Concluding Remarks
 APPENDIX
 Acknowledgments
 References
 
Conventional Codon Models
For the conventional codon model (Goldman and Yang, 1994), the instantaneous transition rate from codon i to codon j is represented as follows:


Formula 1

(1)
where {omega}, {kappa}, and {pi}j represent the strength of selective pressure, relative occurrence of transition with respect to transversion, and frequency of codon j, respectively. {omega} Is typically used to detect selection, but its interpretation requires caution (see Results and Discussion). In Equation 1, the instantaneous transition rate is proportional to the frequency of the replaced codon, {pi}j. Other codon models (Muse and Gaut, 1994; Wong and Nielsen, 2004) have been developed in which the instantaneous transition rate is proportional to the frequency of the replaced nucleotide.

Conventional Amino Acid Models
For amino acid models, the instantaneous transition rate from amino acid ai to amino acid aj is represented as follows:


Formula 2

(2)
where {pi}aj is the frequency of amino acid aj and saiaj represents the relative speed of replacement from ai to aj during infinitesimal time. The values of the parameter saiaj for various amino acid models have been estimated by parsimony-based (Dayhoff et al., 1978; Jones et al., 1992) and likelihood-based (Adachi and Hasegawa, 1996; Whelan and Goldman, 2001) procedures.

For the sake of convenience of numerical computation, it is typically assumed that the models are time reversible, and this implies that saiaj = sajai . The values for {pi}aj are occasionally estimated from a sequence database, but substantial improvements in model fit can often be realized when the values for {pi}aj are estimated from the particular protein family of interest (Cao et al., 1994). When the frequencies are derived solely from the protein families of interest, we follow convention by adding a "+F" suffix to the name of the amino acid model. A 20x20 instantaneous rate matrix R can be derived from Equation AminoModel. The matrix of the transition probabilities between amino acids during time t is calculated as exp {tR}. For the sake of convenience of mathematical notation, we do not consider any normalizing factor for the rate matrix.

Converting Amino Acid Model to Codon Model by Incorporating Synonymous Substitutions
We consider a new codon model in which the instantaneous transition rate from codons i to j is given as follows:


Formula 3

(3)
where aj is the amino acid that codon j encodes, {pi}j is the frequency of codon j, and saiaj is obtained from the amino acid model given by Equation 2. We assume that {pi}aj = {sum}k:ak = aj {pi}k. This assumption is easy to reconcile with estimates when the +F option is employed and the observed frequencies are used to estimate parameters {pi}aj and {pi}k. The free parameter {rho} represents the speed of synonymous substitution relative to the various values of saiaj in infinitesimal time. We refer to this codon model as "SK-P1," where P1 implies that there exists one free parameter in the rate matrix.

SK-P1 is designed for the conversion of an amino acid model to a codon model. In an amino acid model, all types of amino acid transitions are allowed in infinitesimal time, even if multiple nucleotide substitutions are involved. For the sake of consistency with the amino acid model, we allow instantaneous multiple nucleotide substitution in Equation SK-P1. Multiple-nucleotide substitutions in infinitesimal time is biologically plausible because it can be caused by compensatory mutation, recombination, gene conversion, etc., and its incorporation into the codon model can improve the likelihood score (Whelan and Goldman, 2004).

To measure the strength of selective pressure, {rho} should be compared with the average of all saiaj values; this can be calculated as follows:


Formula 4

(4)
The ratio Formula represents the relative speed of nonsynonymous substitutions with respect to the synonymous ones. When Formula is equal to one, nonsynonymous substitutions occur "on an average" as often as synonymous substitutions. Further, Formula is compatible with the conventional {omega} and dn/ds (see Results and Discussion).

For SK-P1, the transition probabilities between codons can be derived from the transition probabilities for amino acid models. We can show that the transition probabilities from codons i to j are represented as follows (see Appendix):


Formula 5

(5)
The transition probability Paiaj (t) from ai to aj can be obtained by the exponentiation of the amino acid rate matrix given in Equation 2.

Codon Model Equivalent to Amino Acid Model
We consider another new codon model SK-P0, which is nested in SK-P1. In a fixed amount of time t, the amount of synonymous substitution increases with {rho} in Equation SK-P1. However, the amount of nonsynonymous substitution remains constant because the rate matrix is not normalized. The amount of nonsynonymous substitution is the same as that of amino acid substitution in Equation 3. When synonymous substitutions are completely saturated ({rho} = {infty}), SK-P1 is reduced to SK-P0. Figure 1 provides a pictorial summary of the relationships between amino acid models and SK-P0 and SK-P1.


Figure 1
View larger version (42K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Figure 1 Relationship among amino acid model, SK-P0, and SK-P1. A 20-dimensional amino acid model can be transformed to the 61-dimensional SK-P1 model by incorporating the parameter {rho} of synonymous substitution. When {rho} = {infty}, SK-P1 is reduced to SK-P0, which is equivalent to the amino acid model. Amino acid models can be compared with other 61-dimensional codon models via SK-P0.

 
We note that SK-P0 is a codon model equivalent to the amino acid model. When {rho} = {infty}, Equation 5 is reduced to Pij(t) = Paiaj (t) {pi}j/{pi}aj . For the given sequence data D and vector t of branch lengths, let us denote the likelihoods of SK-P0 and the amino acid model as Lc(t| D) and La(t| D), respectively. Assuming that there are T sequences with length n codons, we can show the following relationship (see Appendix):


Formula 6

(6)
where cij is the codon at the jth site of the ith taxon and acij is the amino acid that cij encodes. From Equation 6, we can see that the maximum likelihood (ML) estimate of t for SK-P0 is identical to that of t for the amino acid model. The product of the second term on the right side of Equation 6 corresponds to the action of "dimension reduction" (translation of codon). Equation 6 shows that the likelihood of SK-P0 is expressed as the likelihood of the amino acid model multiplied by the relative codon frequencies over all taxa and sites. This likelihood representation makes it possible to regard amino acid models as a subset of codon models. By comparing SK-P0 with other codon models, we can indirectly compare the amino acid model with other codon models.


    Results and Discussion
 Top
 Abstract
 Methods
 Results and Discussion
 Concluding Remarks
 APPENDIX
 Acknowledgments
 References
 
Information Pertaining to Synonymous Substitutions: Cases of Mammals and Yeast
We analyzed two empirical data sets—12 mammalian mitochondrial (Nikaido et al., 2003) and 106 yeast (Rokas et al., 2003) protein-coding genes. The mitochondrial genomes of 69 mammalian species were downloaded from GenBank (refer to Nikaido et al., 2003 for accession numbers). Except for NADH6, all 12 protein-coding genes were aligned using ClustalW (Chenna et al., 2003) with the default option and then manually edited. The ML method was adopted to estimate the rate parameters and branch lengths for Tree 6 given by Nikado et al. The 106 yeast protein-coding genes were provided by Rokas et al. (2003). For each gene, the ML tree estimated by Ren et al. (2005) was used for the ML estimation of branch lengths and rate parameter {rho}. It is known that mammalian mitochondria evolve fast (Brown et al., 1979) and the yeast taxa share very old divergence times (Dujon, et al, 2004). The purpose of our analysis is to test if divergent sequences still contain informative signals pertaining to synonymous substitutions. In our analysis, we converted the mtREV+F amino acid model (Adachi and Hasegawa, 1996) to SK-P0 and SK-P1 for mammalian mitochondrial data and the WAG+F amino acid model (Whelan and Goldman, 2001) to SK-P0 and SK-P1 for yeast data. Other amino acid models (Dayhoff et al., 1978; Jones et al., 1992) can be converted in a straightforward manner.

Figure 2 shows the importance of synonymous substitutions by comparing the log-likelihood scores for SK-P0 and SK-P1 for mammalian (Fig. 2a) and yeast (Fig. 2b) genes. The log-likelihood scores for SK-P0 are represented on the horizontal axes, whereas the improved log-likelihood scores for SK-P1 are represented on the vertical axes. SK-P1 is reduced to SK-P0 when {rho} = {infty}. Because infinity is located on the boundary of the parameter space, a 50:50 mixture of {chi}20 and {chi}21 distributions could be adopted to verify the significance of twice the log-likelihood difference (Self and Liang, 1987) (2{Delta}loglike). The 5% rejection region of this mixed distribution is (2.71, {infty}). All score differences are highly significant (p << 0.001). This result indicates that SK-P0 is an inappropriate model because it ignores synonymous substitutions.


Figure 2
View larger version (38K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Figure 2 Comparison of the log-likelihoods of SK-P0and SK-P1. The horizontal axes represent the log-likelihood scores of SK-P0 for (a) mammalian mitochondrial and (b) yeast genes. The vertical axes represent the differences in the log-likelihood scores of the two models.

 
Relationship between Synonymous Saturation and Evolutionary Distance
The analysis of mammal and yeast data reveals that synonymous substitutions are very important and should not be ignored. To verify whether this result can be generalized to other protein-coding sequences, we investigate the circumstances under which SK-P1 becomes significantly better than SK-P0. We consider a simple scenario comprising two sequences separated by an evolutionary distance d. The appropriate normalizing factor of Equation 3 is taken into account such that d represents the number of codon substitutions per codon site. To further simplify the investigation, we do not consider the uncertainty of 2{Delta}loglike. For various values of d, we calculate the minimum codon length (n) such that E[2{Delta}loglike] is located within a 5% rejection region of the mixed distribution as follows.

For codons i and j at the pth site of the two sequences, let us denote the likelihoods for SK-P1 and SK-P0 as f1(xp| {rho}, t, {pi}) and f0(xp| t, {pi}), respectively. Then,


Formula

where Pij(1)(t) and Pij(0)(t) are from Equation 5 for SK-P1 and SK-P0, respectively. Assuming that SK-P1 holds true, the expectations of sitewise log likelihoods are


Formula

The minimum codon size (n) beyond which SK-P1 is significantly better than SK-P0 can be obtained using the following relationship:


Formula

where 2.71 is the left boundary of (2.71, {infty}). The interval of (2.71, {infty}) is a 5% rejection region under the 50:50 mixture of {chi}20 and {chi}21 distributions. For the given values of {pi}i, {rho}, and t, the evolutionary distance (d) is calculated as t{sum}i{{pi}i {sum}j != iRij}, where the values of Rij are those obtained from Equation 3.

Figure 3 is obtained by adopting the estimates of {pi}i values and {rho} from COX1 as the true parameters of Equation 3. The estimated Formula ({approx} {omega}, dn/ds) of COX1 is 0.00174. Based on this ratio and the {pi}i values of COX1, 1.0 codon substitutions per codon site correspond to 0.963 synonymous and 0.037 nonsynonymous substitutions per codon site. Although only two sequences are considered and the uncertainty of 2{Delta}loglike is not represented, Figure 3 provides a rough insight into the range of the sequence length and the evolutionary distance within which synonymous substitutions should not be ignored. Figure 3 shows that SK-P0 is strongly rejected for moderately long sequence pairs separated by a small evolutionary distance (d). In the analysis of more than two sequences, the superiority of SK-P1 is more pronounced because reconstructed phylogenies typically contain short branches, and SK-P1 offers a significant advantage for such short branches. Moreover, E[2{Delta}loglike] increases with the number of available sequences; thus, the probability of SK-P0 being rejected increases. Therefore, we expect that the advantage offered by SK-P1 over SK-P0 is not specific to mammals and yeasts but is applicable to all protein-coding sequences.


Figure 3
View larger version (22K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Figure 3 Minimum codon size of two sequences beyond which SK-P1 is significantly better than SK-P0. The distance on the horizontal axis represents the number of codon substitutions per codon site for SK-P1 (bottom) and the number of amino acid substitutions per codon site for SK-P1 (top). The vertical axis shows the minimum codon size (n) such that E[2{Delta} loglike] belongs to the 5% critical region of the mixed {chi}2 distribution. The estimates of the codon frequencies and parameter {rho} from COX1 were used for the calculation of the minimum codon size shown on the vertical axis.

 
Effect of Ignoring Synonymous Substitutions
When a protein is subjected to selective pressure, the resulting nonsynonymous substitution is not neutral, and its evolutionary pattern may differ from that of a synonymous substitution. Therefore, considering only nonsynonymous substitutions and ignoring synonymous substitutions may distort the estimations of phylogeny and divergence times. Consistent with previous findings (Anderson and Swofford, 2004), Figure 4 shows that the branch-length estimation can be significantly influenced by model selection. The trees of mammalian COX1 reconstructed under SK-P0 (red) and SK-P1 (blue) are superposed by aligning the longest branch lengths from the root to each tip. The red (blue) bar at the bottom represents 0.1 nonsynonymous substitutions (codon substitution) per codon site. The red tree is identical to the tree reconstructed under the mtREV+F amino acid model because SK-P0 is equivalent to mtREV+F in our analysis of mammalian mitochondrial protein-coding genes. In Figure 4, the significant deviation from proportionality is consistent with the fact that COX1 has been subjected to different selective pressures in different lineages during mammalian evolution (Wu et al., 2000). When the red and blue trees are roughly proportional to each other (e.g., NADH2; data not shown), we expect that the nonsynonymous substitutions have been subjected to roughly constant selective pressures. Even when the two trees are roughly proportional to each other, we note that ignoring synonymous substitutions is unacceptable; this is because SK-P1 is significantly better than SK-P0, as shown in Figure 2.


Figure 4
View larger version (28K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Figure 4 Trees of mammalian COX1 reconstructed with SK-P0 (red) and SK-P1 (blue). The red bar at the bottom represents 0.1 nonsynonymous substitutions per codon site, and the blue bar represents 0.1 codon substitutions per codon site. Two trees are superposed by aligning the longest branch from the root to the tip. The red tree is identical to the tree constructed for the mtREV+F model. The numbered subgroups are (1) Euarchonta, (2) Glires, (3) Scrotifera, (4) Eulipotyphla, (5) Afrotheria, (6) Marsupialia, and (7) Monotremata.

 
Merging SK-P1 and Conventional Codon Models
One notable feature of SK-P0 and SK-P1 is that they require considerably less computation than conventional 61-dimensional codon models (Goldman and Yang, 1994; Muse and Gaut, 1994). They require the execution of a 20x 20 amino acid rate matrix instead of a 61x 61 codon rate matrix; this is because the transition probabilities between codons can be calculated from the transition probabilities between amino acids (Equation 5).

The conventional modification of codon models (Yang and Nielsen, 1998; Nielsen and Yang, 1998) can be similarly applied to SK-P1. That is, we can allow different lineages of phylogeny or different sites to have different values of parameter {rho}. The heterogeneity of {rho} among lineages and sites can be simultaneously considered, as shown in a previous study (Yang and Nielsen, 2002). Furthermore, we note that any amino acid model can be incorporated into SK-P1. We analyzed the mammalian data by incorporating the saiaj values of the JTT+F and WAG+F amino acid models (Jones et al., 1992; Cao et al., 1994; Whelan and Goldman, 2001) and found that the overall results in Figure 2 do not change (data not shown).

Goldman and Yang's (1994) conventional codon model, presented in Equation 1 (GY94), can be generalized by including the estimates of the saiaj parameters from amino acid-replacement models. We refer to the resulting model as SK-P2, and its instantaneous transition rate from codon type i to codon type j is


Formula 7

(7)
Whereas GY94 assigns a free parameter {omega} to nonsynonymous substitutions (Equation 1), SK-P2 assigns a free parameter {rho} to synonymous substitutions. As in the case of SK-P1, the ratio Formula represents the strength of selective pressure. The average of all saiaj values is


Formula

In the above equation, k(i,j) = {kappa}, 1, or 0 if the replacement from i to j involves one transition, one transversion, or more than one nucleotide substitution, respectively.

Phenotypic information such as the physical properties of amino acids (Goldman and Yang, 1994; Hasegawa, 1998) and the structural stability of entire protein sequences (Robinson et al., 2003) can be incorporated into codon models. In SK-P2, a database-derived amino acid model is incorporated (see the related approaches by Doron-Faigenboim and Pupko, 2007, and Kosiol et al., 2007). Just as phenotypic information may improve codon models, amino acid-replacement models may improve codon models. As a preliminary data analysis, we compared the log-likelihood scores with GY94 and SK-P2. For the values of the SK-P2 parameters saiaj and {pi}j, the mtREV+F amino acid-replacement model was used. There were 11 mammalian mitochondrial genes that showed improved scores with SK-P2 and only one relatively short gene of about 70 codons (ATP8) that had a poor score for SK-P2 (data not shown).

When saiaj {equiv} 1 for all ai != aj and {rho} = 1/{omega} ({omega} != 0) in Equation 7, SK-P2 becomes equivalent to GY94 (Equation 1). The amino acid-replacement model with all saiaj {equiv} 1 corresponds to the "proportional" model of amino acid replacement (Cao et al., 1994). That is, GY94 is SK-P2 in which the proportional model is incorporated. The proportional model is the simplest amino acid model because the database-derived estimates of the saiaj parameters are not needed. Because any amino acid model can be incorporated into SK-P2, its applicability is more flexible than that of GY94. As in Equation 7, Muse and Gaut's (1994) codon model, in which the instantaneous transition rate is proportional to the frequency of replaced nucleotide, can be generalized by incorporating saiaj values; however, it is not possible to exactly equate the SK-P2 model to Muse and Gaut's codon model.

Interpretation of {omega}, dn/ds, and Formula{rho}
Typically, {omega} of Equation 1 is used to represent the strength of selective pressure. Another conventional measure of selective pressure is dn/ds (e.g., Nei and Gojobori, 1986), where dn (ds) represents the number of nonsynonymous (synonymous) substitutions per nonsynonymous (synonymous) site. In this section, we focus on a discussion on dn/ds, which also applies to {omega} and Formula . The threshold value of dn/ds is usually set to be one. A dn/ds value of less than one is interpreted as a purifying selection, whereas that exceeding one is regarded as a positive selection. However, as noted by Ohta and Basten (1992), sequences may have been subjected to positive selection even when dn/ds is less than one.

Let us assume that the synonymous mutation rate per generation per sequence is µ. Under neutral evolution, the genetic diversity of a synonymous site within a population is represented by 4Nµ, where N is the effective population size, and the fixation rate of a synonymous mutant within the population is equal to the mutation rate per generation (µ) (Kimura, 1983). Similarly, let us assume that the nonsynonymous mutation rate per generation is v. As in the case of synonymous mutation, the genetic diversity and fixation rate of the nonsynonymous substitution are 4Nv and v, respectively, under neutral evolution. For simplicity, we assume that all the third codon sites are synonymous and all the first and second codon sites are nonsynonymous. Then, the number of nonsynonymous sites is twice that of synonymous sites, and dn and ds are v/(2L) and µ/L, respectively, where L is the number of codon sites. The value of dn/ds is one when 2µ = v. However, this is not the condition required by neutral evolution. Both synonymous and nonsynonymous mutants can evolve neutrally irrespective of the relationship between µ and v. Neutrality is the evolutionary process after a mutant originates within a population, and it does not restrict the mutation rates for different types of mutants. This can be more clearly and easily understood when we consider the application of DNA substitution models such as HKY85 (Hasegawa et al., 1985) to noncoding DNA sequences. Although the evolution of noncoding sequences is expected to be almost neutral, transition-type substitutions have a much higher evolutionary rate than transversion-type substitutions. This is due to their differing mutation rates, and not due to positive selection.

McDonald and Kreitman (1991) observed that positive selection is operative on the alcohol dehydrogenase (ADH) sequences from three Drosophila species by classifying the codon substitutions into the following 2 x 2 categories: fixed/polymorphic and synonymous/nonsynonymous substitutions. The number of nonsynonymous and fixed (nonsynonymous and polymorphic) sites is 7 (2). The number of synonymous and fixed (synonymous and polymorphic) sites is 17 (42). If nonsynonymous substitutions are selectively neutral, the proportion of fixed nonsynonymous sites relative to the polymorphic nonsynonymous sites should be equal to that of synonymous sites. However, the former was much larger than the latter in ADH (7/2 > 17/42; p < 0.01). The numbers of fixed and polymorphic nonsynonymous sites are smaller than those of synonymous sites. However, nonsynonymous substitutions fix more quickly than synonymous substitutions, which is indicative of positive selection. Based on McDonald and Kreitman's data, dn/ds = {7/(2L)}/{17/L} = 7/34 < 1. In other words, dn/ds is less than one, although the sequences have been positively selected.

The interpretation of our new measure of Formula is similar to that of {omega} and dn/ds. When these quantities are larger than one, we can consider the sequences to have undergone positive selection. Because the nonsynonymous mutation rate would be lower than synonymous mutation rate in most situations, the increase in dn relative to ds would be due to positive selection. However, it is hard to determine whether the sequences were under purifying selection even when dn/ds < 1, as can be seen in the case of McDonald and Kreitman's data. Therefore, a simple and safe interpretation of the three measures is "the relative occurrence of nonsynonymous substitution with respect to synonymous substitution"; the three measures are compatible in this sense. Finally, we note that "Formula " does not indicate strict neutrality when there is no difference between synonymous and nonsynonymous mutation rates. When Formula , any sequence generated with the SK-P1 or SK-P2 model would be different from that generated by a simple model such as HKY85 (Hasegawa et al., 1985) and noticeably different from intergenic neutral DNA.


    Concluding Remarks
 Top
 Abstract
 Methods
 Results and Discussion
 Concluding Remarks
 APPENDIX
 Acknowledgments
 References
 
In this study, we developed new codon models that are derived from amino acid models. By employing these codon models, it is possible to statistically compare amino acid and codon models. Through the analysis of real data and theoretical results, we showed that synonymous substitutions are very informative and substantially improve evolutionary inference, even when the sequences are highly divergent. This indicates that we should be very careful while selecting the appropriate model between the codon and amino acid ones for the analysis of protein-coding sequences. We observed that the SK-P1 model proposed in this study is a good alternative to the amino acid model in the cases wherein synonymous substitutions cannot be ignored. We expect that our procedure to quantify synonymous information will play a critical role in the evolutionary analysis of protein-coding DNA sequences, which are gaining importance in the current genomic era.


    APPENDIX
 Top
 Abstract
 Methods
 Results and Discussion
 Concluding Remarks
 APPENDIX
 Acknowledgments
 References
 
Proof of Equation 5
Let us denote the amino acid rate matrix of Equation 2 as Ra:


Formula 8

(8)
Ra is diagonalized as follows.


Formula 9

(9)
where ui(a) and {lambda}i(a) are a pair of eigenvectors and eigenvalues of Ra, and Ua Va = Va Ua = I. The transition probabilities among amino acids can be calculated using the matrix exponential


Formula 10

(10)

Now, let us denote the 61 x 61 codon rate matrix in Equation 3 as Rc,


Formula

We rearrange the codon order in Rc so that it is consistent with the amino acid order in Ra. For example, if the first and second amino acids in Ra are isoleucine (Ile) and valine (Val), respectively, we place three codons (ATT, ATC, and ATA) that encode Ile in the first part of the rows and columns in Rc and thereafter place four codons (GTT, GTC, GTA, and GTG) that encode Val. The codon order that encodes the same amino acid can be arbitrarily determined. For convenience, in the following mathematical notation, the integer and the name of the amino acid are interchangeably used as an index for the eigensystems of Ra. For example, u1(a) and uIle(a) are interchangeably used. Similar to Equation 9, Rc is diagonalized as follows:


Formula 11

(11)
where Uc Vc = Vc Uc = I.

We found that the eigenvalues and eigenvectors of Rc are explicitly expressed using those of Ra, {rho}, and the frequencies of amino acids and codons. The eigenvalues ({lambda}i(c), 1 ≤ i ≤ 61) of Rc are


Formula

where {gamma}(i) is the index of the first codon that encodes ai. For example, {gamma}(1) = {gamma}(2) = {gamma}(3) = 1 because the first codon (ATT) encoding Ile is located in the first row and column of Rc. {gamma}(4) = {gamma}(5) = {gamma}(6) = {gamma}(7) = 4 because the first codon (GTT) encoding Val is located in the fourth row and column of Rc. {lambda}ai (a) is the eigenvalue of Ra corresponding to ai in Equation 2. Raiai is obtained from Ra given by Equation 2. Then, Dc is represented as


Formula 12

(12)

The elements of the eigenvector u(c)i = (u1,i(c), c, u61,i(c))T corresponding to the eigenvalue {lambda}i(c) are expressed as


Formula

where uaj,ai (a) is the element of uai (a) in Equation 9. Then, Uc is


Formula 13

(13)

Likewise, Vc is expressed in terms of Va and the frequencies of amino acids and codons. The elements of row vector viT(c) = (vi,1(c), c, vi,61(c)) in Equation 11 can be shown as


Formula

where vaj,ai (a) is the element of vai (a) in Equation 11. Then, Vc is


Formula 14

(14)

The transition probabilities between codons can be calculated using the matrix exponential


Formula 15

(15)
From Equations 10, 12, 13, 14, and 15, it can be shown that


Formula 16

(16)
where Paiaj (t) is the transition probability from amino acids ai to aj by Equation 10.

Remark.—When {rho} = {infty}, SK-P1 is reduced to SK-P0, which is equivalent to the amino acid model. In other words, the amino acid model can be regarded as a codon model in which synonymous substitution is completely saturated. The transition probability of SK-P0 is also derived by integrating the transition probability of SK-P1 over {rho}, assuming that no information for {rho} is available:


Formula

where P'ij(t) and Pij(t) are the transition probabilities with SK-P0 and SK-P1, respectively. The result of this reduction is the same as that of the reduction with {rho} = {infty}.

Proof of Equation 6
We prove Equation 6 for a case of four taxa. The proof is straightforwardly extended for the case of more than four taxa. suppose that we observe four terminal codons p, q, r, and s at the jth site. The unknown ancestral codon of p and q (r and s) is x (y). The branch length from x to p (q) is t1 (t2) and the branch length from y to r (s) is t3 (t4). The branch length between x and y is t5. Let us denote the likelihoods at the jth site for SK-P0 and the amino acid model as Lc(j) and La(j), respectively.

Then, the likelihood at the jth site for SK-P0 is


Formula

The product of all sitewise likelihoods results in Equation 6.


Formula

Remark.—Equation 6 is applicable when the codon frequencies are known and are consistent with the amino acid frequencies. When the data are provided in the form of amino acid sequences, only the amino acid frequencies are known. In this case, we can consider a "flat" Dirichlet distribution for the unknown frequencies of codons that encode the same amino acid. With this Dirichlet distribution, the average of {pi}cij /{pi}acij is 1/nacij , where nacij is the number of codons that encode amino acid acij . Therefore, Equation 6 is transformed into


Formula 17

(17)
When the codon frequencies are known, the log-likelihood scores derived from Equations 6 and 17 can be compared, and the significance of the log-likelihood difference can be tested by resampling sitewise log-likelihood scores. Through this comparison, we can test whether the codon frequencies are significantly informative.


    Acknowledgments
 Top
 Abstract
 Methods
 Results and Discussion
 Concluding Remarks
 APPENDIX
 Acknowledgments
 References
 
T.-K.S. was supported by JSPS (EYS B-18770216), and H.K. was supported by JSPS (SR B-16300086). We thank Jeff Thorne, Joe Felsenstein, Paul Lewis, Jack Sullivan, and two anonymous reviewers for providing valuable comments, A. Rokas for providing the sequence data, and Z. Yang for providing the 106 ML trees of yeast.


    References
 Top
 Abstract
 Methods
 Results and Discussion
 Concluding Remarks
 APPENDIX
 Acknowledgments
 References
 

    Abascal F., Zardoya R., Posada D. ProtTest: Selection of best-fit models of protein evolution. Bioinformatics (2005) 21:2104–2105.[Abstract/Free Full Text]

    Adachi J., Hasegawa M. Model of amino acid substitution in proteins encoded by mitochondrial DNA. J. Mol. Evol. (1996) 42:459–468.[Web of Science][Medline]

    Akaike H. A new look at the statistical model identification. IEEE Trans. Autom. Contr. (1974) 19:716–723.[CrossRef]

    Anderson F. E., Swofford D. L. Should we be worried about long-branch attraction in real data sets? Investigations using metazoan 18S rDNA. Mol. Phylogenet. Evol. (2004) 33:440–451.[CrossRef][Web of Science][Medline]

    Brown W. M., George M., Wilson A. C. Rapid evolution of animal mitochondrial DNA. Proc. Natl. Acad. Sci. USA (1979) 76:1967–1971.[Abstract/Free Full Text]

    Buckley T. R. Model misspecification and probabilistic tests of topology: Evidence from empirical data sets. Syst. Biol. (2002) 51:509–523.[Abstract/Free Full Text]

    Cao Y., Adachi J., Janke A., Paabo S., Hasegawa M. Phylogenetic relationships among eutherian orders estimated from inferred sequences of mitochondrial proteins: Instability of a tree based on a single gene. J. Mol. Evol. (1994) 39:519–527.[Web of Science][Medline]

    Chenna R., Sugawara H., Koike T., Lopez R., Gibson T. J., Higgins D. G., Thompson J. D. Multiple sequence alignment with the Clustal series of programs. Nucleic Acids Res. (2003) 31:3497–3500.[Abstract/Free Full Text]

    Dayhoff M. O., Schwartz R. M., Orcutt B. C. A model of evolutionary change in proteins. In: Atlas of protein sequence and structure—Dayhoff M. O., ed. (1978) Washington, DC: National Biomedical Research Foundation. 345–352.

    Doron-Faigenboim A., Pupko T. A combined empirical and mechanistic codon model. Mol. Biol. Evol. (2007) 24:388–397.[Abstract/Free Full Text]

    Dujon B., Sherman D., Fischer G., Durrens P., Casaregola S., Lafontaine I., de Montigny J., Marck C., Neuvéglise C., Talla E., Goffard N., Frangeul L., Aigle M., Anthouard V., Babour A., Barbe V., Barnay S., Blanchin S., Beckerich J.-M., Beyne E., Bleykasten C., Boisramé A., Boyer J., Cattolico L., Confanioleri F., de Daruvar A., Despons L., Fabre E., Fairhead C., Ferry-Dumazet H., Groppi A., Hantraye F., Hennequin C., Jauniaux N., Joyet P., Kachouri R., Kerrest A., Koszul R., Lemaire M., Lesur I., Ma L., Muller H., Nicaud J.-M., Nikolski M., Oztas S., Ozier-Kalogeropoulos O., Pellenz S., Potier S., Richard G.-F., Straub M.-L., Suleau A., Swennen D., Tekaia F., Wésolowski-Louvel M., Westhof E., Wirth B., Zeniou-Meyer M., Zivanovic I., Bolotin-Fukuhara M., Thierry A., Bouchier C., Caudron B., Scarpelli C., Gaillardin C., Weissenbach J., Wincker P., Souciet J.-L. Genome evolution in yeasts. Nature. 430:35–44.

    Goldman N., Yang Z. A codon-based model of nucleotide substitution for protein-coding DNA sequences. Mol. Biol. Evol. (1994) 11:725–736.[Abstract]

    Hasegawa M., Kishino H., Yano T. Dating of the human-ape splitting by a molecular clock of mitochondrial DNA. J. Mol. Evol. (1985) 22:160–174.[CrossRef][Web of Science][Medline]

    Jones D. T., Taylor W. R., Thornton J. M. The rapid generation of mutation data matrices from protein sequences. Comput. Appl. Biosci. (1992) 8:275–282.[Abstract/Free Full Text]

    Kimura M. The neutral theory of molecular evolution. (1983) Cambridge, UK: Cambridge University Press.

    Kosiol C., Holmes I., Goldman N. An empirical codon model for protein sequence evolution. Mol. Biol. Evol. (2007) 24:1464–1479.[Abstract/Free Full Text]

    Maynard Smith J., Smith N.H. Synonymous nucleotide divergence: What is saturation? Genetics (1996) 142:1033–1036.[Abstract]

    McDonald J. H., Kreitman M. Adaptive protein evolution at the Adh locus in Drosophila. Nature (1991) 351:652–654.[CrossRef][Medline]

    Muse S. V., Gaut B. S. A likelihood approach for comparing synonymous and nonsynonymous nucleotide substitution rates with application to the chloroplast genome. Mol. Biol. Evol. (1994) 11:715–724.[Abstract]

    Nei M., Gojobori T. Simple methods for estimating the numbers of synonymous and nonsynonymous nucleotide substitutions. Mol. Biol. Evol. (1986) 3:418–426.[Abstract]

    Nielsen R., Yang Z. Likelihood models for detecting positively selected amino acid sites and applications to the HIV-1 envelope gene. Genetics (1998) 148:929–936.[Abstract/Free Full Text]

    Nikaido M., Cao Y., Harada M., Okada N., Hasegawa M. Mitochondrial phylogeny of hedgehogs and monophyly of Eulipotyphla. Mol. Phylogenet. Evol. (2003) 28:276–284.[CrossRef][Web of Science][Medline]

    Ohta T., Basten C. J. Gene conversion generates hypervariability at the variable regions of Kallikreins and their inhibitors. Mol. Phylgenet. Evol. (1992) 1:87–90.[CrossRef]

    Ren F., Tanaka H., Yang Z. An empirical examination of the utility of codon-substitution models in phylogeny reconstruction. Syst. Biol. (2005) 54:808–818.[Abstract/Free Full Text]

    Robinson D. M., Jones D. T., Kishino H., Goldman N., Thorne J. L. Protein evolution with dependence among codons due to tertiary structure. Mol. Biol. Evol. (2003) 20:1692–1704.[Abstract/Free Full Text]

    Rokas A., Williams B. L., King N., Carroll S. B. Genome-scale approaches to resolving incongruence in molecular phylogenies. Nature (2003) 425:798–804.[CrossRef][Medline]

    Schwarz G. Estimating the dimension of a model. Ann. Stat. (1974) 6:461–464.[CrossRef]

    Self S. G., Liang K.-L. Asymptotic properties of maximum likelihood estimators and likelihood ratio tests under nonstandard conditions. J. Am. Stat. Assoc. (1987) 82:605–610.[CrossRef][Web of Science]

    Seo T.-K., Kishino H., Thorne J. L. Estimating absolute rates of synonymous and nonsynonymous nucleotide substitution in order to characterize natural selection and date species divergences. Mol. Biol. Evol. (2004) 21:1201–1213.[Abstract/Free Full Text]

    Stuart A., Ord K. Kendall's advanced theory of statistics (1996) Volume 2, 5th. London, UK: Edward Arnold.

    Thorne J. L., Kishino H., Painter I. S. Estimating the rate of evolution of the rate of molecular evolution. Mol. Biol. Evol. (1998) 15:1647–1657.[Abstract]

    Whelan S., Goldman N. A general empirical model of protein evolution derived from multiple protein families using a maximum-likelihood approach. Mol. Biol. Evol. (2001) 18:691–699.[Abstract/Free Full Text]

    Whelan S., Goldman N. Estimating the frequency of events that cause multiple-nucleotide changes. Genetics (2004) 167:2027–2043.[Abstract/Free Full Text]

    Wong W. S. W., Nielsen R. Detecting selection in noncoding regions of nucleotide sequences. Genetics (2004) 167:949–958.[Abstract/Free Full Text]

    Wu W., Schmidt T. R., Goodman M., Grossman L. I. Molecular evolution of cytochrome c oxidase subunit I in primates: Is there coevolution between mitochondrial and nuclear genomes? Mol. Phylogenet. Evol. (2000) 17:294–304.[CrossRef][Web of Science][Medline]

    Yang Z., Bielawski J. P. Statistical methods for detecting molecular adaptation? Trends Ecol. Evol. (2000) 15:496–503.[CrossRef][Medline]

    Yang Z., Nielsen R. Synonymous and nonsynonymous rate variation in nuclear genes of mammals. J. Mol. Evol. (1998) 46:409–418.[CrossRef][Web of Science][Medline]

    Yang Z., Nielsen R. Codon-substitution models for detecting molecular adaptation at individual sites along specific lineages. Mol. Biol. Evol. (2002) 19:908–917.[Abstract/Free Full Text]

    Yang Z., Nielsen R., Goldman N., Pedersen A. K. Codon-substitution models for heterogeneous selection pressure at amino acid sites. Genetics (2000) 155:431–449.[Abstract/Free Full Text]

    Yang Z., Nielsen R., Hasegawa M. Models of amino acid substitution and applications to mitochondrial protein evolution. Mol. Biol. Evol. (1998) 15:1600–1611.[Abstract]


Add to CiteULike CiteULike   Add to Connotea Connotea   Add to Del.icio.us Del.icio.us    What's this?


This article has been cited by other articles:


Home page
Syst BiolHome page
T.-K. Seo and H. Kishino
Statistical Comparison of Nucleotide, Amino Acid, and Codon Substitution Models for Evolutionary Analysis of Protein-Coding Sequences
Syst Biol, June 29, 2009; (2009) syp015v1.
[Abstract] [Full Text] [PDF]


Home page
Mol Biol EvolHome page
M. Anisimova and C. Kosiol
Investigating Protein-Coding Sequence Evolution with Probabilistic Codon Substitution Models
Mol. Biol. Evol., February 1, 2009; 26(2): 255 - 271.
[Abstract] [Full Text] [PDF]


Home page
Brief BioinformHome page
W. Delport, K. Scheffler, and C. Seoighe
Models of coding sequence evolution
Brief Bioinform, January 1, 2009; 10(1): 97 - 109.
[Abstract] [Full Text] [PDF]


This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (PDF) Freely available
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in ISI Web of Science
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrow Search for citing articles in:
ISI Web of Science (3)
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Seo, T.-K.
Right arrow Articles by Kishino, H.
Right arrow Search for Related Content
PubMed
Right arrow Articles by Seo, T.-K.
Right arrow Articles by Kishino, H.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?