© 2007 Society of Systematic Biologists
Base-Compositional Heterogeneity in the RAG1 Locus among Didelphid Marsupials: Implications for Phylogenetic Inference and the Evolution of GC Content
Edited by Mark Hafner: Associate Editor
1 Bell Museum of Natural History and Department of Ecology, Evolution, and Behavior, University of Minnesota St. Paul, Minnesota, 55108, USA E-mail: jansa003{at}umn.edu (S.A.J.)
2 Department of Mammalogy, American Museum of Natural History New York, New York, 10024, USA
| Abstract |
|---|
|
|
|---|
Although theoretical studies have suggested that base-compositional heterogeneity can adversely affect phylogenetic reconstruction, only a few empirical examples of this phenomenon, mostly among ancient lineages (with divergence dates > 100 Mya), have been reported. In the course of our phylogenetic research on the New World marsupial family Didelphidae, we sequenced 2790 bp of the RAG1 exon from exemplar species of most extant genera. Phylogenetic analysis of these sequences recovered an anomalous node consisting of two clades previously shown to be distantly related based on analyses of other molecular data. These two clades show significantly increased GC content at RAG1 third codon positions, and the resulting convergence in base composition is strong enough to overwhelm phylogenetic signal from other genes (and morphology) in most analyses of concatenated datasets. This base-compositional convergence occurred relatively recently (over tens rather than hundreds of millions of years), and the affected gene region is still in a state of evolutionary disequilibrium. Both mutation rate and substitution rate are higher in GC-rich didelphid taxa, observations consistent with RAG1 sequences having experienced a higher rate of recombination in the convergent lineages.
Keywords: Biased gene conversion; Didelphidae; nucleotide composition; phylogeny; RAG1; recombination
Received May 18, 2006; Revised July 11, 2006; Accepted September 5, 2006
Base-compositional heterogeneity, as indicated by significant taxonomic differences in GC (versus AT) content among homologous sequences, is now widely recognized as a potential problem for phylogenetic analysis (e.g., by Loomis and Smith, 1990; Hasegawa and Hashimoto, 1993; Lake, 1994; Lockhart et al., 1994; Foster and Hickey, 1999; Mooers and Holmes, 2000; Tarrío et al., 2002; Jeffroy et al., 2006; Barrowclough et al., 2006). Both simulation studies (Lake, 1994; Conant and Lewis, 2001) and analyses of biological data sets (Lockhart et al., 1994; Chang and Campbell, 2000; Tarrío et al., 2002) suggest that commonly used phylogenetic methods may cluster taxa based on GC content rather than evolutionary history when similar base composition has evolved convergently in unrelated clades. Despite considerable interest in base-compositional convergence, only a few empirical examples have been reported in the literature (Conant and Lewis, 2001). The most compelling of these involve genes in photosynthetic organelles (Lockhart et al., 1992, 1994), phyla of bacteria (Galtier and Gouy, 1998), major clades of animals (Foster and Hickey, 1999), and ancient lineages of vertebrates (Chang and Campbell, 2000; Lockhart et al., 1994). Convincing examples of base-compositional convergence at lower taxonomic levels are seldom reported and few (Tarrío et al., 2000, 2001) have been analyzed in detail.
Here we report an exceptionally clear case of base-compositional convergence among New World marsupial mammals of the family Didelphidae, commonly known as opossums. Didelphids comprise about 90 species in 18 genera that collectively range from Patagonia to Canada and inhabit a wide variety of terrestrial habitats (Nowak, 1999; Gardner, 2005; Voss et al., 2005). Recent phylogenetic studies based on nuclear genes and morphology have resulted in highly congruent estimates of didelphid relationships (Jansa and Voss, 2000, 2005; Voss and Jansa, 2003; Voss et al., 2005; Steiner et al. 2005; Jansa et al., 2006). In particular, sequence data from the protein-coding genes interphotoreceptor retinoid-binding protein (IRBP) and dentin matrix protein 1 (DMP1); the noncoding nuclear transthyretin intron 1 (TTR); the mitochondrial 12S rDNA gene; and morphological characters all yield remarkably congruent phylogenetic signal (Steiner et al., 2005; Jansa et al., 2006). Certain relationships, however, have proven difficult to resolve with existing data sets.
To address this problem, we sequenced the recombination activating gene (RAG1) from 41 species of didelphids. RAG1 is a single-copy, intron-free, protein-coding nuclear locus that has been used for phylogenetic inference in a variety of other vertebrate groups (e.g., Groth and Barrowclough, 1999; Barker et al., 2002; Waddell and Shelley, 2003; Mahon and Carpenter, 2004; San Mauro et al., 2005). Because previous analyses of RAG1 sequence data resulted in highly resolved and robustly supported phylogenetic relationships at multiple hierarchical levels, we anticipated a straightforward analysis of our didelphid data. Instead, we discovered a striking example of convergent base composition that affords an unusual opportunity to explore the application of commonly suggested methodological remedies and to examine alternative evolutionary explanations.
| Materials And Methods |
|---|
|
|
|---|
DNA Amplification and Sequencing
We amplified and sequenced a portion of the RAG1 gene corresponding to positions 240 to 3100 of Monodelphis domestica(Genbank accession U51897 [GenBank] ) using the primers listed in Table 1. For most specimens, we amplified three 1000-to 1200-bp fragments directly from genomic DNA and used each of these amplicons in a second nested PCR to create shorter (500-to 600-bp) fragments for sequencing. For some specimens with degraded DNA, however, these smaller fragments were amplified directly from genomic DNA. The initial amplification from genomic DNA was performed as 20 µ l reactions using Ampli-Taq Gold DNA polymerase (Perkin–Elmer, Boston, MA) and reaction conditions as previously described for our work with IRBP (Jansa and Voss, 2000). Reamplification reactions were performed as 30-to 40-µ l reactions with Taq DNA polymerase (Promega, Madison, WI). The resulting PCR products were purified and prepared for sequencing using either a Qiaquick PCR purification kit (QIAGEN, Santa Clarita, CA) or EXO-SAP enzymatic digestion (USB, Cleveland, OH). PCR products were sequenced in both directions using amplification primers and dye-terminator chemistry (BigDye Cycle Sequencing; Applied Biosystems), and sequencing reactions were run on an ABI 3700 automated sequencer. Base-calling ambiguities were resolved either by choosing the base on the cleanest strand or using the appropriate IUB ambiguity code if both strands showed the same ambiguity. All sequences, along with their specimen voucher numbers, have been deposited in GenBank with accession numbers DQ865883 [GenBank] to DQ865923. The data matrix and associated trees have been deposited on TreeBase with accession numbers S1579 and M2841.
|
Sequence Analysis
We used two approaches to test our RAG1 sequences for departures from base-compositional stationarity across taxa. First, we evaluated each taxon for differences from the average base composition across taxa for each codon position using a
2 goodness-of-fit test (Bonferroni-corrected
= 0.001). Second, we calculated the disparity index (ID) for each codon position (Kumar and Gadagkar, 2001) using MEGA version 3.0 (Kumar et al., 2004). This test compares the observed compositional difference between a given pair of sequences (Dc) with the compositional difference (Nd) that would be expected if the sequences were evolving under equilibrium (ID = Dc – Nd). Two sequences that are perfectly homogeneous will have an ID value of zero; consequently, higher ID values reflect increasingly different base composition between sequence pairs. The significance of a given ID value is assessed by comparison with a Monte Carlo–generated distribution of expected values assuming a model of equilibrium base substitution (Kumar and Gadagkar, 2001). Maximum parsimony (MP) analyses were performed using heuristic search algorithms implemented by PAUP* 4.0b10 (Swofford, 2003) with all characters treated as unordered and equally weighted. Heuristic searches were performed with 1000 replicates of random stepwise taxon addition and tree bisection-reconnection (TBR) branch swapping. For maximum likelihood (ML) analyses, we determined the best-fit model of nucleotide substitution for various partitions of the molecular data using the Akaike information criterion (AIC) as implemented in ModelTest 3.6 (Posada and Crandall, 1998). We evaluated the fit of various models to each gene (IRBP, DMP1, and RAG1) independently and to the combined molecular dataset. In addition, we partitioned the RAG1 data into third codon positions and first + second codon positions, and we evaluated the best-fit model for each of these partitions. Finally, we employed a likelihood-ratio test to determine whether a molecular clock could be applied to the sequence data. Following model evaluation and selection, we obtained the maximum likelihood tree using a heuristic search in which parameter values were fixed according to the best-fit model, and a neighbor-joining tree was used as starting point for TBR branch swapping.
We assessed nodal support using nonparametric bootstrapping (Felsenstein, 1985) for the MP and ML analysis and Bremer support (Bremer, 1988, 1994) for the MP analysis. Bootstrap values for the MP analysis were calculated from 1000pseudoreplicated data sets, each of which was analyzed heuristically using 10 random-addition replicates, TBR branch swapping, and no more than 10 trees saved per pseudoreplicate. The ML bootstrap analysis was based on 1000 pseudoreplicated data sets analyzed with the fast stepwise option in PAUP* using the same best-fit model employed in the original ML tree search. We calculated partitioned Bremer support (PBS; Baker and DeSalle, 1997) to quantify the relative support and conflict provided by different data partitions for individual nodes in the combined-data (IRBP+DMP1+RAG1+morphology) tree. All Bremer values were calculated using constraint trees and command files generated by TreeRot (Sorenson, 1999). We used likelihood-based Shimodaira-Hasegawa tests (SH; Shimodaira and Hasegawa, 1999) to evaluate the fit of a given data set to alternative topologies. Specifically, we addressed whether the RAG1 data could reject the tree derived from IRBP+DMP1 data set and vice versa. All SH tests were performed using 1000 RELL replicates allowing the parameters for the relevant model of nucleotide substitution to be estimated.
Finally, we applied two phylogeny-reconstruction methods specifically designed to compensate for base-compositional heterogeneity to the RAG1 data set. First, we calculated LogDet-corrected distances (Lockhart et al., 1994) and inferred a neighbor-joining tree from the resulting matrix using PAUP*. Second, we evaluated a likelihood-based method that allows for variation in GC content across lineages, as implemented in the software package nhml (Galtier and Gouy, 1998). Unfortunately, this method (hereafter referred to as GG98) has been reported to search only a small fraction of available tree space when more than a few taxa are analyzed (Galtier and Gouy, 1998). Given this limitation, we tested for convergence by performing two separate tree searches. In the first, we used the tree derived from all data except RAG1 as the starting topology for branch swapping; in the second, we used the tree derived from RAG1 sequences alone as the initial topology.
To examine evolutionary rates for third positions of RAG1, we used a penalized likelihood approach as implemented in r8s version 1.50 (Sanderson, 2002). To provide the tree and branch lengths for this analysis, we optimized branch lengths for RAG1 third positions on the topology resulting from likelihood analysis of all data except RAG1 third positions. We established the best value for the smoothing parameter using the cross-validataion procedure with smoothing values ranging from 10–10 to 100. Searches using the optimal smoothing value (S = 0.001) employed the truncated Newton algorithm with 10 random restarts and 10 subsequent perturbations. Estimates of per-branch rates were scaled against time by setting the origin of didelphines to 40 million years based on the results published by Steiner et al. (2005).
| Results |
|---|
|
|
|---|
Phylogenetic Analysis
The results of our previous didelphid phylogenetic research based on the interphotoreceptor retinoid binding protein (IRBP), dentin matrix protein 1 (DMP1), and morphology have been reported elsewhere (Jansa and Voss, 2000, 2005; Voss and Jansa, 2003; Jansa et al., 2006) and require only summary description here. Briefly, although trees resulting from separate analyses of these data sets show some topological differences, none are significant when assessed by comparing bootstrap support values node-by-node or by performing Shimodaira-Hasegawa tests. Therefore, we combined the IRBP, DMP1, and morphological data in a single matrix and analyzed them simultaneously using parsimony. The resulting strict consensus of four minimal-length trees (Fig. 1A) is highly resolved and includes several well-supported groups. In particular, clades C and H are recovered with strong (> 90%) bootstrap support in our combined-data analysis. Both of these groups were previously supported by our separate analyses of IRBP (e.g., Jansa and Voss, 2000) and DMP1 (Jansa et al., 2006), and they have also been recovered by scnDNA hybridization (Kirsch and Palma, 1995) and by Bayesian analyses of 12S rRNA and TTR intron sequences (Steiner et al., 2005).
|
In contrast, clades C and H were not recovered by parsimony analysis of RAG1 sequences (Fig. 1B) because clades B and I form a group. This anomalous node (B+I) is strongly supported by RAG1 (parsimony bootstrap support = 97%) and does not appear to have resulted simply from long-branch attraction artifacts that are known to mislead parsimony analysis. In fact, an ML analysis of the RAG1 data set under its best-fit model of nucleotide substitution (GTR+I+
) recovered a tree (not shown) that was fully resolved but otherwise topologically consistent with the parsimony tree; notably, the B+I node was also strongly supported (likelihood bootstrap = 97%). Reciprocal Shimodaira-Hasegawa tests indicate that the tree derived from ML analysis of RAG1 is significantly different from that supported by the combined DMP1+IRBP data set (RAG1 on DMP1+IRBP tree:
= 113.7, P < 0.001; DMP1+IRBP on RAG1 tree:
= 141.4, P < 0.001). So strong is RAG1 support for the B+I group that a combined (simultaneous) parsimony analysis of all four data sets (IRBP+DMP1+RAG1+morphology) also recovers this node (Fig. 2). Bootstrap support for B+I, however, drops from 97% (in the RAG1-only analysis) to 78% (in the combined data analysis), presumably because of character conflict between RAG1 and the other datasets. Predictably, partitioned Bremer support (PBS) values are either all positive or only weakly conflicting at most nodes in the combined-data tree, but they are strongly conflicting for the B+I group: this node receives support only from RAG1 (PBSRAG1 = +22.2), whereas the total PBS from the other data sets is strongly negative (PBSIRBP + DMP1 + morphology = –12.2). Of particular interest is the observation that most (97%) of the support for the B+I group comes from RAG1 third codon positions.
|
Base Composition Analyses
The frequency of G+C at third codon positions of RAG1 varies by nearly 20% across these didelphid taxa, whereas GC content at first and second positions of this gene varies by only 4% (Fig. 3). In contrast, IRBP and DMP1 show only slight variation in GC content at all codon positions. Both first + second and third positions of IRBP show about 4% variation in G+C frequency; first + second positions of DMP1 vary by less than 3%, whereas third positions of DMP1 vary by about 8%. Not surprisingly then, neither IRBP nor DMP1 sequences exhibit significant departures from base-compositional homogeneity across taxa at any codon positions (Jansa and Voss, 2000, 2005; Jansa et al., 2006). In contrast, our RAG1 sequences exhibit marked departures from base-compositional homogeneity across taxa. Although this gene showed no significant base-compositional bias at first or second positions for any didelphid taxon, 30 species (of the 41 we sequenced) show a significant departure from base-compositional homogeneity at third positions as assessed by
2 goodness-of-fit tests. Of these, 18 species have GC content at third positions (GC3) equal to or greater than 70% (Table 2). Given the significant base-compositional heterogeneity at third positions of RAG1, we analyzed this partition independently using both MP and ML methods. Trees recovered under each analytical approach (the ML tree is shown in Fig. 4) were topologically identical to the parsimony and likelihood trees previously obtained from the entire gene. As expected, the anomalous B+I group receives very strong bootstrap support (97% ML; 98% MP) from RAG1 third positions.
|
|
|
The 18 species with GC3
70% are all members of clade B or clade I (Fig. 4), suggesting that the anomalous grouping of these taxa is due to shared similarity in base composition. The ID values we computed to quantify base-compositional heterogeneity lend additional credence to this hypothesis. A UPGMA phenogram that clusters taxa based on ID scores reveals three main groupings: (1) taxa with GC3
70%, (2) taxa with GC3 = 61–66%, and (3) taxa with GC3 = 57–59% (Fig. 4, inset). The largest differences in ID scores (at least 8.3 ID units) occur between taxa with GC3
70% and the remaining taxa (with GC3 = 57–66%).
If the anomalous grouping of clades B and I is due to convergence in GC3 content, then we would expect to observe that AT
GC substitutions exceed AT
GC substitutions at the nodes subtending each of the affected clades. To test for this bias, we used parsimony to optimize character-state transitions for RAG1 third positions on a tree derived from analysis of all other data (first and second positions of RAG1+IRBP+DMP1 +morphology). In fact, character-state reconstructions at the nodes subtending clades B and I do show an extraordinary bias in favor of AT
GC substitutions (Table 3). This bias continues, although less strongly, along branches within each of these clades: the ratio of the total number of AT
GC substitutions is on average three times higher than the number of AT
GC substitutions within clades B and I (Table 3). In contrast, exemplar clades that are less GC-rich (i.e., with GC3 in the range of 57% to 66%) have a ratio of AT
GC to AT
GC substitutions that is much closer (0.9) to the one-to-one ratio expected under equilibrium.
|
Interestingly, third positions of RAG1 also appear to be evolving more quickly among taxa that have elevated GC3 content than among other didelphids, a phenomenon that seems to be attributable both to an increased rate of nucleotide substitution and to an increased mutation rate. A higher substitution rate is suggested by significant rate heterogeneity across taxa at RAG1 third positions as assessed by a hierarchical likelihood-ratio test designed to assess the fit of a molecular clock (–lnLGTR + I +
+ clock = 9268.74,
= 547.28, df = 39, P << 0.001), together with a significant positive correlation between pairwise patristic distances (based on ML-estimated branch lengths) and the extent of departure from base-compositional homogeneity (represented by ID values) for RAG1 third positions (Mantel test: 1000 replicates, r = 0.495, P < 0.001). Moreover, a penalized likelihood analysis of molecular evolutionary rate indicates that third positions of RAG1 are evolving much more quickly in the GC3-rich clades (B and I) than they are in any other taxa (Fig. 5). This rate increase in GC3-rich taxa appears to be peculiar to the RAG1 locus because it is not apparent for DMP1 or IRBP. Although DMP1 does exhibit significant among-taxa rate heterogeneity (–lnLGTR + I +
+ clock = 5966.36,
= 117.78, df = 39, P << 0.001), long branches for this data partition are not associated with the same taxa that show long branches for RAG1 third positions. IRBP exhibits no significant rate heterogeneity among didelphids (–lnLGTR + I +
+ clock = 4511.05,
= 53.67, df = 39, p = 0.059).
|
Mutation rates are impossible to measure directly from these data, but rough estimates can be obtained from individual heterozygosity values. Under the standard neutral expectation, heterozygosity scales as the product of effective population size and mutation rate (Kimura, 1983). If effective population size does not differ substantially among these taxa (an assumption that we cannot currently evaluate), then heterozygosity could be a reasonable proxy for mutation rate. Individual heterozygosity values for the RAG1 locus across didelphids range from zero to 0.005 and are significantly positively correlated with GC3 content (r2 = 0.0961, P = 0.049). In contrast, heterozygosity values for the other two loci (IRBP and DMP1) show no correlation with GC3 content (r2 = 0.004, P = 0.696). Therefore, RAG1 also appears to be unique in having an elevated mutation rate for those taxa with high GC3 content.
To assess whether taxa in clade B and clade I are now at base-compositional equilibrium, we used the test described by Eyre-Walker (1994) that compares the relative GC content between pairs of taxa at sites that differ between them. If two sequences are at equilibrium, we expect half of the differing sites to be G or C and half to be A or T, regardless of whether the two lineages show different overall rates of evolution. Departures from this expectation can be assessed using the binomial distribution. We performed this test for six phylogenetically independent pairs, four of which show an elevated GC3 content and two that do not. Neither of the comparisons involving taxa in non–GC-rich clades shows a significant departure from equilibrium substitution rates, whereas all of the comparisons involving GC-rich taxa do (Table 4). Therefore, members of the two clades that show shifts to higher GC content have yet to return to base-compositional equilibrium.
|
Phylogenetic Effects of Correcting for Base-Compositional Heterogeneity
In order to explore the utility of phylogenetic methods designed to correct for base-compositional heterogeneity among taxa, we analyzed third positions of RAG1 using both LogDet distance transformations (Lockhart et al., 1994) and the GG98 maximum likelihood method that accounts for nonhomogeneous patterns of substitution among taxa (Galtier and Gouy, 1998). Because LogDet analysis may be affected by among-site rate heterogeneity (Lockhart et al., 1994), we calculated LogDet distances three different ways: including only parsimony informative sites; including a likelihood estimate for the proportion of invariant sites (pinv = 0.32); and including all sites without correction. Unfortunately, none of these approaches seem to be effective for these data. All analyses recover the anomalous B+I group with very high bootstrap support (Fig. 4). Because the GG98 model allows GC content to vary across all branches in the tree, it is parameter rich and too computationally intensive to perform a ML bootstrap analysis. To improve our confidence that we recovered the best tree while searching under this model, we used two different tree topologies (one based on all data except RAG1; one based on RAG1 only) as starting points for the branch-swapping algorithm. Both searches converged on the same topology, which contained the anomalous B+I group.
| Discussion |
|---|
|
|
|---|
Phylogenetic Implications
Our results clearly indicate that one episode of GC enrichment occurred in the ancestral lineage of clade B (Gracilinanus +Thylamys +Cryptonanus), and that another occurred in the ancestor of clade I (Marmosa + Micoureus). The resulting convergence in base composition is sufficiently strong that phylogenetic analyses of RAG1 sequences, separately or in combination with other data sets, strongly support a group (B+I) that contains these two GC-rich but historically unrelated clades. It is of no small concern that two phylogenetic methods explicitly designed to account for base-compositional heterogeneity among taxa not only failed to recover the correct placement of clades B and I but supported their anomalous grouping with high statistical support. Thus, RAG1 sequence evolution in didelphids provides a particularly clear example of convergent base composition misleading phylogenetic inference.
Didelphid RAG1 evolution is of further interest because it shows that severe base-compositional shifts can occur over comparatively short evolutionary timescales. Recent phylogenetic research indicates that the marsupial crown group is not known from pre-Tertiary fossils (Rougier et al., 1998). Moreover, according to Steiner et al.'s (2005) age estimates of didelphid cladogenetic events (based on relaxed-molecular-clock methods applied to nuclear gene sequences), the base-compositional shifts described in this report probably occurred sometime between 12 and 30 million years ago. By contrast, most previously reported examples of convergent base composition (cited in our Introduction) involve clades that are hundreds of million years old. The relative recency of base-compositional shifts for didelphid RAG1 sequences may explain, in part, why the two GC-rich lineages we identified show no evidence of having reached base-compositional equilibrium.
Several workers have suggested that convergent base composition will only be problematic for phylogenetic inference if taxa differ in GC content by more than about 10% (Eyre-Walker, 1998; Galtier and Gouy, 1995). Others (Conant and Lewis, 2001) have emphasized that additional evolutionary processes, including biased transition/transversion ratios and among-site rate heterogeneity, may interact with base-compositional heterogeneity to lead phylogenetic inference astray. In particular, the inability to correctly infer phylogeny from convergently GC-biased sequences appears to be exacerbated by short internal branches (Conant and Lewis, 2001; Jermiin et al., 2004). For RAG1 third positions, the average difference in GC content between biased and nonbiased clades (
GC = 14%) is above the minimum value considered problematic for phylogenetic inference. After we added 4194 bp from other nuclear genes (IRBP, DMP1, and the remaining RAG1 codon positions), however, the average base-compositional difference between these clades dropped well below this value (
GC = 3%). Nevertheless, we were still unable to recover the correct phylogeny from a combined-data analysis.
If small differences in base composition among taxa can have misleading phylogenetic consequences, then the sensitivity of existing tests for this phenomenon is an important methodological issue. For the concatenated-gene data set, both the contingency test for base-compositional stationarity employed by PAUP* and the ID test (Kumar and Gadagkar, 2001) revealed significant base-compositional heterogeneity across taxa, even though
GC was only 3%. In contrast, the
2 goodness-of-fit test for individual departures from average GC content was unable to identify particular taxa that deviated from the average base composition of these data. However, we note that these
GC estimates are based on the total number of sites sequenced, rather than the subset of those sites that are parsimony informative. When we recalculated average differences in base composition between clades based on informative sites alone, the average
GC for RAG1 third positions between biased and nonbiased clades is 29%. When we performed a similar correction for the concatenated-gene data set, the average GC content between these clades was still high enough to be considered problematic (
GC = 15%). Therefore, we echo other workers (Jermiin et al., 2004; Conant and Lewis, 2001; Lockhart et al., 1994; Collins et al., 2005) in cautioning that it is prudent to test carefully for base-compositional bias prior to any phylogenetic analysis of nucleotide sequence data.
Although RAG1 is nonstationary and phylogenetically misleading at one hierarchical level, this gene provides both increased resolution and support for relationships among taxa within the GC-rich clades (clades B and I; compare Fig. 1A and Fig. 1B). Therefore, we would like to include the relevant phylogenetic signal that this gene contains in a combined-data analysis while mitigating the misleading signal due to base-compositional convergence. One possibility involves explicitly modeling (Galtier and Gouy, 1998; Yang and Roberts, 1995) or correcting for (Lockhart et al., 1994; Lake, 1994) base-compositional heterogeneity across taxa. However, model-based approaches can be parameter-rich and difficult to implement in a combined data analysis; they are also conspicuously ineffective when applied to our nonstationary RAG1 sequences. Short of eliminating the biased gene entirely, different ways of handling such problematic datasets have been proposed (Loomis and Smith, 1990; Hasegawa and Hashimoto, 1993; Kumar and Gadagkar, 2001; Collins et al., 2005; Jeffroy et al., 2006). Among the most tractable are (1) analyzing amino acid sequences, (2) eliminating third positions, or (3) including only third position transversions (e.g., by "RY-coding"). As applied to our data, all three methods effectively eliminate the misleading signal due to base-compositional convergence: the spurious B+I clade was not recovered by any method, and nodes C and H were recovered with comparably high bootstrap support in all three analyses (Fig. 6). However, these procedures do affect the degree of resolution and support for recovered relationships within the GC-rich clades.
|
Evolution of Base Composition
Although RAG1 is a problematic source of data for inferring phylogenetic relationships among didelphids, examining the evolution of this gene on the best estimate of didelphid phylogeny derived from other data provides useful information about the causal mechanisms of base-compositional heterogeneity. RAG1 clearly exhibits very different evolutionary dynamics in two GC-rich lineages than it does in other clades. In particular, the higher rates of substitution and mutation for taxa in clades B and I merit explanation. One possibility invokes recent theory that links recombination rate and the evolution of GC content in mammalian genomes.
During meiotic recombination, different alleles may form heteroduplex DNA that contains T:G and C:A mismatches (Lamb, 1984; Galtier et al., 2001). Theoretically, these mismatches should be repaired correctly to T:A and C:G, respectively. However, mismatch repair in mammals appears to be biased by a DNA repair mechanism that preferentially repairs T:G mismatches to C:G (Brown and Jiricny, 1987). This process has been termed biased gene conversion (BGC) and, if it occurs frequently, could cause regions of high recombination to become increasingly GC-rich (Galtier et al., 2001; Montoya-Burgos et al., 2003; Meunier and Duret, 2004). Mounting evidence strongly favors BGC as an important process driving base-compositional differences across the mammalian genome (Eyre-Walker and Hurst, 2001; Birdsell, 2002; Piganeau et al., 2002; Galtier, 2004; Perry and Ashworth, 1999). Moreover, if sequences are not at base-compositional equilibrium (and these GC-rich didelphid sequences are not; Table 4), then BGC predicts that the rate of silent substituiton (Ks) will be both elevated and GC-biased (Piganeau et al., 2002). Consistent with these predictions, the GC-rich didelphid lineages all exhibit an elevated rate of substitution and a biased direction of substitution from AT
GC. In addition, heterozygosity values for these taxa suggest that the GC-rich lineages also exhibit a relatively high rate of mutation, which is not necessarily a predicted outcome of BGC. Nonetheless, recent studies have suggested that recombination is itself mutagenic (Hellman et al., 2003; Lercher and Hurst, 2002), and it is possible that this aspect of recombination is directly contributing to the elevated mutation rate for these GC-rich taxa.
Recent evidence suggests that recombination hotspots can change genomic location over short evolutionary timescales; indeed, hotspot location varies even within human populations (Coop, 2005; Jeffreys et al., 2005). Given these findings, one possible explanation for our results is that RAG1 independently acquired a recombination hotspot in the ancestral lineages of clades B and I. Although we do not have direct estimates of recombination rates for these taxa, if RAG1 contains a 1-to 2-kb recombination hotspot, then we would expect the GC-rich region in affected taxa to be localized over only part of the sequenced 2.7-kb fragment. A sliding-window analysis reveals that GC-rich and non–GC-rich sequences have a similar base-compositional profile across the first 600 to 700 bp of the RAG1 exon, and that the GC-rich sequences diverge and adopt their characteristic base composition thereafter (Fig. 7). Although this result may be consistent with alternative explanations, including selection maintaining base composition across the 5' region of the gene, it may be evidence of an elevated recombination rate for part of the gene. If so, and if recombination hotspots are as evolutionarily mobile as recent studies suggest, then the base-compositional convergence that plagues our phylogenetic analysis of the RAG1 locus may be more widespread across nuclear loci than previously anticipated.
|
|
| Acknowledgments |
|---|
|
|
|---|
We particularly thank Keith Barker and George Barrowclough for helpful discussions and comments. As ever, we are grateful to the curators and collections support personnel of the many museums that provided tissue samples for this work. In particular, we thank Phil Myers and Steve Hinshaw (UMMZ), Bruce Patterson and Bill Stanley (FMNH), Mark Engstrom and Burton Lim (ROM), Jim Patton (MVZ), and Robert Baker (TTU). Andrew Simons, Robert Zink, Mark Hafner, Roderic Page, Peter Foster, and two anonymous reviewers provided helpful comments on an earlier draft. The University of Minnesota Supercomputing Institute provided valuable computing time and resources. This study was supported by funds from the National Science Foundation (DEB-0211952) and the University of Minnesota.
| REFERENCES |
|---|
|
|
|---|
-
Baker R. H., DeSalle R. Multiple sources of character information and the phylogeny of Hawaiian drosophilids. Syst. Biol. (1997) 46:654–673.
Barker F. K., Barrowclough G. F., Groth J. G. A phylogenetic hypothesis for passerine birds: Taxonomic and biogeographic implications of an analysis of nuclear DNA sequence data. Proc. R. Soc. Lond. B (2002) 269:295–308.[Medline]
Barrowclough G. F., Groth J. G., Mertz L. A. The RAG-1 exon in the avian order Caprimulgiformes: Phylogeny, heterozygosity, and base composition. Mol. Phylog. and Evol. (2006) 41:238–248.[CrossRef]
Birdsell J. A. Integrating genomics, bioinformatics, and classical genetics to study the effects of recombination on genome evolution. Mol. Biol. Evol. (2002) 19:1181–1197.
Bremer K. The limits of amino-acid sequence data in angiosperm phylogenetic reconstruction. Evolution (1988) 42:795–803.[CrossRef][Web of Science]
Bremer K. Branch support and tree stability. Cladistics (1994) 10:295–304.[CrossRef][Web of Science]
Brown T. C., Jiricny J. A specific mismatch repair event protects mammalian cells from loss of 5-methylcytosine. Cell (1987) 50:945–950.[CrossRef][Web of Science][Medline]
Chang B. S. W., Campbell D. L. Bias in phylogenetic reconstruction of vertebrate rhodopsin sequences. Mol. Biol. Evol. (2000) 17:1220–1231.
Collins T. M., Fedrigo O., Naylor G. P. Choosing the best genes for the job: The case for stationary genes in genome-scale phylogenetics. Syst. Biol. (2005) 54:493–500.
Conant G. C., Lewis P. O. Effects of nucleotide composition bias on the success of the parsimony criterion in phylogenetic inference. Mol. Biol. Evol. (2001) 18:1024–1033.
Coop G. Can a genome change its (hot)spots? Trends Ecol. Evol. (2005) 20:643–645.
Duret L., Hurst L. D. The elevated GC content at exonic third sites is not evidence against neutralist models of isochore evolution. Mol. Biol. Evol. (2001) 18:757–762.
Eyre-Walker A. DNA mismatch repair and synonymous codon evolution in mammals. Mol. Biol. Evol. (1994) 11:88–98.[Abstract]
Eyre-Walker A. Problems with parsimony in sequences of biased base composition. J. Mol. Evol. (1998) 47:686–690.[CrossRef][Web of Science][Medline]
Eyre-Walker A., Hurst L. D. The evolution of isochores. Nat Revi: Genet. (2001) 2:549–555.[CrossRef]
Felsenstein J. Confidence limits on phylogenies: An approach using the bootstrap. Evolution (1985) 39:783–791.[CrossRef][Web of Science]
Foster P. G., Hickey D. A. Compositional bias may affect both DNA-based and protein-based phylogenetic reconstructions. J. Mol. Evol. (1999) 48:284–290.[CrossRef][Web of Science][Medline]
Galtier N. Recombination, GC-content, and the human pseudoautosomal boundary paradox. Trends Genet. (2004) 20:347–349.[CrossRef][Web of Science][Medline]
Galtier N., Gouy M. Inferring phylogenies from DNA sequences of unequal base compositions. Proc. Natl. Acad. of Sci. USA (1995) 92:11317–11321.
Galtier N., Gouy M. Inferring pattern and process: Maximum-likelihood implementation of a nonhomogeneous model of DNA sequence evolution for phylogenetic analysis. Mol. Biol. Evol. (1998) 15:871–879.[Abstract]
Galtier N., Piganeau G., Mouchiroud D., Duret L. GC-content evolution in mammalian genomes: The biased gene conversion hypothesis. Genetics (2001) 159:907–911.
Gardner A. L. Order Didephimorphia. In: Mammal species of the world: A taxonomic and geographic reference—Wilson D. E., Reeder D. M., eds. (2005) Baltimore, Maryland: Johns Hopkins University Press. 3–22.
Groth J. G., Barrowclough G. F. Basal divergences in birds and the phylogenetic utility of the nuclear RAG-1 gene. Mol. Phylogenet. Evol. (1999) 12:115–123.[CrossRef][Web of Science][Medline]
Hasegawa M., Hashimoto T. Ribosomal RNA trees misleading? Nature (1993) 361:23.[Medline]
Hellman I., Ebersberger I., Ptak S. E., Pääbo S., Przeworski M. A neutral explanation for the correlation of diversity with recombination rates in humans. Am. J. Hum. Genet. (2003) 72:1527–1535.[CrossRef][Web of Science][Medline]
Jansa S. A., Forsman J. F., Voss R. S. Different patterns of selection on the nuclear genes IRBP and DMP-1 affect the efficiency but not the outcome of phylogeny estimation for didelphid marsupials. Mol. Phylogen. Evol. (2006) 38:363–380.[CrossRef][Web of Science][Medline]
Jansa S. A., Voss R. S. Phylogenetic studies on didelphid marsupials I. Introduction and preliminary results from nuclear IRBP gene sequences. J. Mammal. Evol. (2000) 7:43–77.
Jansa S. A., Voss R. S. Phylogenetic relationships of the marsupial genus Hyladelphys based on nuclear gene sequences and morphology. J. Mammal. (2005) 86:853–865.[CrossRef][Web of Science]
Jeffreys A. J., May C. A. Intense and highly localized gene conversion activity in human meiotic crossover hot spots. Nat. Genet. (2004) 2:151–156.
Jeffroy O., Brinkmann H., Delsuc F., Philippe H. Phylogenomics: The beginning of incongruence? Trends Genet. (2006) 22:225–231.[CrossRef][Web of Science][Medline]
Jermiin L. S., Ho S. Y. W., Ababneh F., Robinson J., Larkum A. W. D. The biasing effect of compositional heterogeneity on phylogenetic estimates may be underestimated. Syst. Biol. (2004) 53:638–643.
Kimura M. The neutral theory of molecular evolution (1983) Cambridge, UK: Cambridge University Press.
Kumar S., Gadagkar S. R. Disparity index: A simple statistic to measure and test the homogeneity of substitution patterns between molecular sequences. Genetics (2001) 158:1321–1327.
Kumar S., Tamura K., Nei M. MEGA3: Integrated software for molecular evolutionary genetics analysis and sequence alignment. Brief. Bioinformatics (2004) 5:150–163.
Lake J. A. Reconstructing evolutionary trees from DNA and protein sequences: Paralinear distances. Proc. Nat. Acad. Sci. USA (1994) 91:1455–1459.
Lamb B. C. The properties of meiotic gene conversion important in its effects of evolution. Heredity (1984) 53:113–138.[CrossRef][Web of Science][Medline]
Lercher M. J., Hurst L. D. Human SNP variability and mutation rate are higher in regions of high recombination. Trends Genet. (2002) 18:337–340.[CrossRef][Web of Science][Medline]
Lockhart P. J., Howe C. J., Bryant D. A., Beanland T. J., Larkum A. W. D. Substitutional bias confounds inference of cyanel origins from sequence data. J. Molec. Evol. (1992) 34:153–162.[Medline]
Lockhart P. J., Steel M. A., Hendy M. A., Penny D. Recovering evolutionary trees under a more realistic model of sequence evolution. Mol. Biol. Evol. (1994) 12:605–612.
Loomis W. F., Smith D. W. Molecular phylogeny of Dictyostelium discoideum by protein sequence comparison. Proc. Nat. Acad. Sci. USA (1990) 87:9093–9097.
Mahon A. R., Carpenter K. E. A phylogeny of putative perciform suborders based on the nuclear RAG1 gene. Integr. Comp. Biol. (2004) 44:723–723.
Meunier J., Duret L. Recombination drives the evolution of GC-content in the human genome. Mol. Biol. Evol. (2004) 21:984–990.
Montoya-Burgos J. I., Boursot P., Galtier N. Recombination explains isochores in mammalian genomes. Trends Genet. (2003) 19:128–130.[CrossRef][Web of Science][Medline]
Mooers A. O., Holmes E. C. The evolution of base composition and phylogenetic inference. Trends Ecol. Evol. (2000) 15:365–369.[CrossRef][Medline]
Nowak R. M. Walker's mammals of the world (1999) 6th edition. Baltimore, Maryland: Johns Hopkins University Press.
Perry J., Ashworth A. Evolutionary rate of a gene affected by chromosomal position. Curr. Biol. (1999) 9:987–989.[CrossRef][Web of Science][Medline]
Piganeau G., Mouchiroud D., Duret L., Gautier C. Expected relationship between the silent substitution rate and the GC content: Implications for the evolution of isochores. J. Mol. Evol. (2002) 54:129–133.[CrossRef][Web of Science][Medline]
Posada D., Crandall K. A. ModelTest: Testing the model of DNA substitution. Bioinformatics (1998) 14:817–818.
Rougier G. W., Wible J. R., Novacek M. J. Implications of Deltatheridium specimens for early marsupial history. Nature (1998) 396:459–463.[CrossRef]
San Mauro D., Vences M., Alcobendas M., Zardoya R., Meyer A. Initial diversification of living amphibians predated the breakup of Pangaea. Am. Nat. (2005) 165:590–599.[CrossRef][Web of Science][Medline]
Shimodaira H., Hasegawa M. Multiple comparisons of log-liklehoods with applications to phylogenetic inference. Mol. Biol. Evol. (1999) 18:1114–1116.
Sorenson M. D. TreeRot (1999) Boston, Massachusetts: Boston University.
Steiner C., Tilak M., Douzery E. J. P., Catzeflis F. M. New DNA data from a transthyretin nuclear intron suggest an Oligocene to Miocene diversification of living South America opossums (Marsupialia: Didelphidae). Mol. Phylogenet. Evol. (2005) 35:363–379.[CrossRef][Web of Science][Medline]
Swofford D. L. PAUP*. Phylogenetic analysis using parsimony (*and other methods), version 4 (2003) Sunderland, Massachusetts: Sinauer Associates.
Tarrìo R., Rodrìguez-Trelles F., Ayala F. J. Tree rooting with outgroups when they differ in their nucleotide composition from the ingroup: The Drosophilasaltans willistoni groups, a case study. Mol. Phylogenet. Evol. (2000) 16:344–349.[CrossRef][Web of Science][Medline]
Tarrìo R., Rodrìguez-Trelles F., Ayala F. J. Shared nucleotide composition biases among species and their impact on phylogenetic reconstructions of the Drosophilidae. Mol. Biol. Evol. (2001) 18:1464–1473.
Voss R. S., Jansa S. A. Phylogenetic studies on didelphid marsupials II. Nonmolecular data and new IRBP sequences: Separate and combined analyses of didelphine relationships with denser taxon sampling. Bull. Am. Mus. Nat. Hist. (2003) 276:1–82.[CrossRef]
Voss R. S., Lunde D. P., Jansa S. A. On the contents of Gracilinanus Gardner and Creighton, 1989, with the description of a previously unrecognized clade of small didelphid marsupials. Am. Mus. Novitates (2005) 3482:1–34.[CrossRef]
Waddell P. J., Shelley S. Evaluating placental inter-ordinal phylogenies with novel sequences including RAG1, gamma-fibrinogen, ND6, and mt-tRNA, plus MCMC-driven nucleotide, amino acid, and codon models. Mol. Phylogenet. Evol. (2003) 28:197–224.[CrossRef][Web of Science][Medline]
This article has been cited by other articles:
![]() |
N. C. Sheffield, H. Song, S. L. Cameron, and M. F. Whiting Nonstationary Evolution and Compositional Heterogeneity in Beetle Mitochondrial Phylogenomics Syst Biol, August 1, 2009; 58(4): 381 - 394. [Abstract] [Full Text] [PDF] |
||||
![]() |
S. V. Edwards Natural selection and phylogenetic analysis PNAS, June 2, 2009; 106(22): 8799 - 8800. [Full Text] [PDF] |
||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||

70%; grey, 70% < BS < 90%; black, BS 







