| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
© 2008 Society of Systematic Biologists
The Impact of Reticulate Evolution on Genome Phylogeny
Edited by Olaf Bininda-Emonds
1 Faculty of Computer Science, Dalhousie University, and Institute for Molecular Bioscience/ARC Centre for Bioinformatics Brisbane, Australia; E-mail: beiko{at}cs.dal.ca
2 Genome Atlantic, Department of Biochemistry & Molecular Biology, Dalhousie University Halifax, Nova Scotia, Canada
| Abstract |
|---|
|
|
|---|
Genome phylogenies are used to build tree-like representations of evolutionary relationships among genomes. However, in condensing the phylogenetic signals within a set of genomes down to a single tree, these methods generally do not explicitly take into account discordant signals arising due to lateral genetic transfer. Because conflicting vertical and horizontal signals can produce compromise trees that do not reflect either type of history, it is essential to understand the sensitivity of inferred genome phylogenies to these confounding effects. Using replicated simulations of genome evolution, we show that different scenarios of lateral genetic transfer have significant impacts on the ability to recover the "true" tree of genomes, even when corrections for phylogenetically discordant signals are used.
Keywords: Evolutionary simulation; genome phylogeny; lateral genetic transfer
Received March 9, 2008; Revised May 29, 2008; Accepted August 25, 2008
An important motivator in many phylogenetic analyses is that the branching relationships inferred from a set of orthologous sequences may serve as a direct indicator of organismal phylogeny. The best example of this is found in the use of 16S and 18S ribosomal DNA sequences for phylogenetic classification of organisms (Woese et al., 1990). The recognition that a single set of putatively orthologous sequences may not yield an accurate depiction of organismal descent due to violations of the phylogenetic model used, insufficient phylogenetic signal, cryptic paralogy, or lateral genetic transfer (LGT) led to a partial abandonment of single-gene methods in prokaryotes. In their place emerged a plethora of methods that depend on much greater data availability: concatenated sequence phylogenies (Baldauf et al., 2000; Brochier et al., 2004), supertrees and networks constructed from many individual phylogenetic trees (Daubin et al., 2001; Creevey et al., 2004; Beiko et al., 2005; Holland et al., 2005), and whole-genome methods that typically simplify genetic data to yield an easily computed summary of relationships between genomes.
Many genome properties have been used as basic characters for the inference of genome-genome relationships, including gene content (Snel et al., 1999; Lake and Rivera, 2004), gene order (Sankoff et al., 1992; Belda et al., 2005), and properties of the distribution of sequence similarities between genomes (Clarke et al., 2002; Auch et al., 2006). All of the above analyses assume that convergent evolution of the character under consideration is unlikely, when compared with other genome properties such as (for example) G+C content or synonymous codon usage. The clearest violation of this non-convergence assumption is in the reduction of parasitic genomes: because genomes tend to lose many of the same genes when undergoing genome reduction, using gene absence as a parsimoniously informative character leads to artifactual grouping of small genomes in gene content trees (Wolf et al., 2001; Kunin et al., 2005). Other, more subtle biases may influence these methods as well: if distantly related organisms have similar nucleotide or amino acid usage due to, e.g., similar mutational biases, nutrient limitations, or environmental conditions, then some of their genes may appear less distant from one another than the true divergence time would suggest (Weisburg et al., 1989). LGT, which can introduce sequences into a genome from organisms with any degree of relatedness, yields an apparent convergence of gene content, similarity, or sometimes order, but in fact violates the fundamental assumption in phylogenetic methods that evolution is tree-like.
An understanding of the extent and impact of LGT is crucial to interpreting the relationships shown in a genome tree. One approach is to identify sequences that are discordant using a surrogate method and either downweight their contribution to the genome tree or eliminate them entirely from consideration (Clarke et al., 2002; Dutilh et al., 2004; Gophna et al., 2005). A limitation of this approach is that different methods for identifying conflicting genes tend to identify different sets of genes (Ragan, 2001, 2006), so the choice of homology criterion and filtering method can have a substantial impact on the inferred genome history. The fundamental problem is that without an accurate estimate of the extent and source of LGT within a given data set, it is very difficult to assess the impact of LGT on the final genome tree. In many published analyses, there is strong reason to suspect that LGT has influenced the position of certain lineages. In some cases, taxa that are thought to participate in frequent transfer are drawn towards one another in the tree. This effect is suggested in the case of the archaeal genus Thermoplasma, which concatenated informational gene phylogenies suggest to be secondarily non-methanogenic (Brochier et al., 2005) but appears as an early-branching euryarchaeal or archaeal lineage in many published studies, including Wolf et al. (2001), Beiko et al. (2005), and Gophna et al. (2005). There is strong evidence for extensive LGT from the thermoacidophilic crenarchaon Sulfolobus to Thermoplasma, which may produce a compromise in the positioning of Thermoplasma in aggregated trees. In some cases, transfer partners appear as sisters in genome trees; for instance, when Arabidopsis appears as a sister taxon to the cyanobacteria in genome trees of metabolic genes (Charlebois et al., 2004).
Although simulation has been used extensively in the investigation and validation of methods in molecular evolution, such techniques are only now being applied to the study of genome evolution (Zhaxybayeva et al., 2006; Galtier, 2007). Part of the reason for this is the relative novelty of genome sequences, but another barrier to meaningful genome simulation has been the difficulty in merging traditional models of sequence change with evolutionary scenarios and interactions within and among genomes. EvolSimulator (Beiko and Charlebois, 2007) has been developed to simulate the evolutionary phenomena most relevant to the study of LGT, including genome-specific mutational and selective regimes, gene content evolution via gene duplications, losses and LGT, and organismal evolution with speciation, extinction, and competition for simulated niches and habitats. Here we use EvolSimulator to generate populations of genomes under different scenarios and frequencies of LGT, to allow a precise delineation of the extent to which different modes of LGT can impact inferred genome histories. The weighting schemes introduced by Gophna et al. (2005), based on the observed phylogenetic concordance or discordance of proteins, can have a substantial impact on the genome tree, and here we assess the effectiveness of these schemes in improving the tree of genomes that is recovered.
| Methods |
|---|
|
|
|---|
Evolutionary Simulations
EvolSimulator version 2.0.4 (Beiko and Charlebois, 2007) was used to evolve populations of genomes with a consistent set of constraints on genomic properties and sequence substitution but different types of LGT regime. Each simulation began with a single, ancestral genome having 1000 unrelated genes (240 to 1500 nt in size), from which a population of genomes would evolve over 5000 iterations. Point mutations were assessed against the standard genetic code as described in Beiko and Charlebois (2007), with amino acid acceptance probabilities proportional to the WAG matrix (Whelan and Goldman, 2000). Insertion and deletion events were not simulated. Genomes were permitted to drift in size by loss and gain (by duplication, and if prescribed, by lateral acquisition) of genes, to as few as 500 genes or as many as 3500. Speciation and extinction events were balanced such that the simulation maintained between 50 and 60 genomes at any given time, following an initial growth phase.
EvolSimulator constructs a user-defined number of "niches" within which genomes reside, the occupation of which is competitively determined by relative gene complements. Each niche has a finite number of spaces that are identical in terms of required genes, potentially limiting the number of genomes that can exploit that niche. Habitats comprise one or more niches and impose specific gene requirements on a resident genome. In these simulations, we distributed 1000 spaces evenly among 100 niches and these 100 niches among 10 habitats. For a genome to spend time in a niche, it must possess all of the genes required by a niche and by its enclosing habitat, such necessities being randomly chosen by EvolSimulator at the start of the run. In addition to this qualitative requirement, the quantitative usefulness of genes within the current niche regulates their propensity to be retained or lost, as well as the amount of time a genome may spend within that niche. If a speciation event creates two genomes, neither of which can migrate to a vacant niche, the niche that was occupied by the ancestral genome will be overcrowded by the descendant lineages until a migration or extinction event removes one genome from this niche. One can, though here we did not, bias speciation/extinction probabilities according to overall genomic fitness.
We explored five independent LGT scenarios in this study: (a) no LGT, (b) random LGT, (c) relations-biased LGT (occurring more often with closer relatives), (d) gene content-biased LGT (occurring more often between genomes with more-similar ortholog constitution), and (e) habitat-restricted LGT (occurring only between genomes concurrently residing in the same habitat). A successful transfer event according to the biasing criterion always led to uptake of the transferred gene; if an ortholog was already present in the recipient genome, it was replaced with the incoming gene. Although we did not explore blends of these scenarios here, we did explore three rates of LGT: in independent runs of scenarios (b) to (e), the mean number of attempted events E was set to 10, 50, and 250 nominal events per iteration. The single run (a), plus three runs each of (b) to (e) with variable E, comprised a complete set of 13 simulation runs. By using the same random number seed for each run in a set, we were able to exactly replicate the speciation and extinction history (i.e., each run within a set had the same reference "organismal" tree) and lineage-specific mutation biases, thus ensuring that differences among the 13 runs in a set would be due only to the LGT model that was used.
Five replicate sets of the 13 scenarios were executed, with each set of replicates employing its own seed for pseudorandom number generation, for a total of 65 simulation runs. For a complete summary of all parameters used in the simulation, please refer to the EvolSimulator configuration file in the Supplemental Material (http://www.systematicbiology.org).
Supplemental Figure 1 (http://www.systematicbiology.org) illustrates the complete set of speciation and extinction events for the entire 5000 iterations of one of our replicates. Extant genomes and closely related extinct lineages have been assigned to eight monophyletic groups,
to
; the precise delineation of these groups is arbitrary, but they all diverged in the earliest 10% of simulated iterations. Colors have been assigned to phylum-level groups in order to highlight cases of invasive intermingling in the inferred phylogenies shown below.
|
Inference of Genome-Scale Phylogeny
Normalized BLASTP-based phylogeny was performed in a manner similar to Clarke et al. (2002), with the distance between every pair of genomes equal to 1.0 minus the mean normalized BLASTP 2.2.2 (Altschul et al., 1997) distance of all pairwise reciprocal best matches (RBMs) between the two genomes. Only RBMs with BLASTP e-values of 1.0 x 10– 5 or less were used to build this matrix. The minimum evolution algorithm implemented in the November 28, 2003, release of FastME (Desper and Gascuel, 2002) was used to build a phylogenetic tree from the matrix of all genome pairwise distances. Statistical support for each distance tree was assessed by resampling the set of RBMs with replacement for each pair of genomes to generate bootstrapped distance matrices, with the corresponding trees constructed using FastME. Resampling was performed 100 times on each data set to yield support values for each bipartition in a given genome tree.
Phylogenetically discordant sequences (PDS) disagree with the majority phylogenetic signal and are frequently observed in real data due both to violations of phylogenetic assumptions and to bona fide instances of LGT. To limit the effect of phylogenetic discordance on inferred genome trees, we applied the PDS procedure described in Clarke et al. (2002) to the set of genomes obtained from each run in replicate set 1 (other replicates were not so examined). Each individual protein has an associated PDS score, which is calculated by comparing the ranking of its similarity to putative orthologs in a list of other genomes (u-values) versus a ranking based on the median of all pairs of putative orthologs from every other genome (w-values). The Spearman rank correlation thus obtained is compared to a large number of correlations obtained by randomizing the comparisons, to obtain a P-value. The mean of the PDS P-values associated with a given pair of RBM proteins could then be used to weight the contribution of that pair to the overall distance measure between the two relevant genomes (Gophna et al., 2005). The unweighted distance D between a pair of genomes A and B with a total of n RBMs is given by the following formula:
|
| (1) |
|
| (2) |
|
| (3) |
Quantifying Differences between Inferred Genome Trees
Each inferred genome tree was compared to the true "organismal" history to assess the accuracy of the inferred relationships. A range of bootstrap support thresholds between 0.50 and 1.00 was used as minimal criteria for strongly supported relationships in the inferred genome trees: in several analyses, conservative (0.90) and liberal (0.70) thresholds were contrasted. Because the organismal reference tree is completely resolved, strongly supported bipartitions present in a given genome tree must either be congruent or incongruent with the reference tree. Disagreements between trees were expressed in terms of the total count of concordant and discordant bipartitions and in terms of the proportion of all resolved bipartitions that were concordant.
EEEP (Beiko and Hamilton, 2006) was used to recover the distance between the organismal reference and each inferred genome tree in turn. The distance used characterizes the number of subtree prune-and-regraft (SPR; Swofford and Olsen, 1990) operations that need to be performed on the organismal reference tree, to obtain the inferred genome tree. Each SPR operation in an edit path implies a donor/recipient pairing, and in the context of single-molecule phylogeny identifies a putative transfer event from one branch to another. SPR events from genome trees may identify major highways of gene sharing, either directly if a given taxon or clade is paired with a major transfer partner or indirectly if the position of a taxon in the tree is intermediate between its transfer partners and its correct position in the organismal tree. It is therefore useful to characterize incongruence both quantitatively in terms of edit distance and qualitatively in terms of the nature of the proposed discordant relationships, given complete knowledge of the "true" trees and type of LGT regime that was simulated.
| Results |
|---|
|
|
|---|
Species Tree, Genome Divergence, and Gene Exchange
Figure 1 shows the phylogenetic relationships between extant organisms at the end of the replicate 1 simulations. Replicate 1 had the largest population of surviving genomes at the end of the simulation, with 56 extant lineages. The other four replicates had a minimum of 48 and a maximum of 54 genomes at iteration 5000. Each replicate had an initial phase of rapid speciation to reach the target population size of 50 genomes, but following this phase the extant genomes could be separated into lineages supported basally by relatively long branches, appearing as logically distinct phyla. Protein sequences that diverged at the beginning of the simulation were still recognizably homologous.
The relationship between genome divergence time and mean normalized BLASTP distance for each pair of genomes is shown in Figure 2. The mean normalized BLASTP distance increases quickly in the first 250 iterations after divergence and then increases more gradually to the maximum number of iterations, albeit with high variance. Distances between pairs of genomes whose last common ancestor occurred near the beginning of the simulation ranged between approximately 0.6 and 0.7. Phylum-level divergence between real pairs of genomes is approximately 0.7 to 0.75 from Clarke et al. (2002), so the maximum level of divergence among genomes in this analysis is roughly equivalent to that seen among bacterial phyla.
|
When the mean number of LGT events per iteration E was set to 250, the rate of LGT was sufficiently high that every gene created at the beginning of the simulation was transferred at least once in its history. In the random LGT scenario with E = 250, the distribution of historical transfers for a sample of 10% of the genes from genome 314 was examined. At the end of the simulation (iteration 5000), each gene had been transferred an average of 70 times during the course of its simulated evolution. No gene had been transferred fewer than 31 times, and the maximum number of transfers in the history of any given gene was 103.
Genome Trees under Different LGT Scenarios
BLASTP-based genome trees are strictly bifurcating, with associated bootstrap proportions (BP) reflecting the frequency that a given bipartition was observed within the set of bootstrap replicate trees. Figure 3 shows, for a series of thresholds, the proportion of bipartitions supported at or above that threshold that are either concordant or discordant with the true genome tree—i.e., that either support or conflict with relationships in the true tree (Wilkinson et al., 2005)—recovered from the five runs (one from each replicate set) that were performed without LGT. In each of the five cases, there was a BP threshold at and above which no discordant bipartitions were supported in the inferred genome tree. This minimum threshold ranged from 0.65 in replicate 2 to 0.90 in replicate 5, with discordance in the latter case due to a misplaced deep branch with bootstrap support of exactly 0.85. High BP thresholds yielded exclusively concordant relationships but reduced the number of resolved bipartitions, with fewer than 40% of all bipartitions resolved at a BP threshold of 1.00 for each of the five trees. Lowering the BP threshold from 1.00 to 0.50 increased the number of resolved bipartitions by approximately a factor of two, at the expense of accepting some discordant relationships: across the five replicates, between 3% and 16% of the resolved relationships at a BP threshold of 0.50 were not consistent with the original genome tree.
|
The genome tree inferred from the LGT-free simulation (E = 0) in replicate 1 is shown in Figure 4. The rapid increase in normalized BLASTP distances in the period immediately following genome divergence (shown in Fig. 2) is evident here in the relatively long branches separating recently diverged taxa. Nonetheless, seven of the eight main monophyletic groupings (
, β,
,
,
,
, and
) were correctly reconstructed. Although several of these groups were reconstructed with bootstrap support
90%, there was some difficulty in recovering the deepest branches, with the deepest-branching genome from group
separated from the other two genomes in this group in spite of a relatively long supporting internal branch in the true tree, and low bootstrap support for groups
and
, likely reflecting the frequent intrusion of genome 205 into group
. Although genome 205 does not branch within group
in the tree shown in Figure 4, a tree computed from the same starting distance matrix but using the Fitch-Margoliash method (Fitch and Margoliash, 1967) displayed this intermingling of phyla. The three groups that were not recovered or recovered with weak support were also the three earliest groups to diverge in Figure 1, suggesting that groups of this age or older may not be recoverable. Within some groups there were discrepancies in the recovered branching order; however, these inconsistencies had low associated bootstrap values. The branching order of major groups was also unreliable; again incorrect relationships were associated with bootstrap values less than 70%.
|
Each of the 13 simulated data sets from replicate 1 was used to construct a bootstrapped genome phylogeny. Table 1 shows the degree of strongly supported concordance and discordance that was found for each reconstructed tree relative to the original genome history. As described above, the tree of genomes simulated without LGT had no discordant bipartitions with bootstrap support of 70% or greater, although only 28 of a possible 51 internal bipartitions had a bootstrap support at least this high, and only 25 bipartitions were supported at a bootstrap threshold of 90%. Among runs where LGT occurred more often between closely related genomes (relations-biased), relatively low rates of LGT yielded resolution and concordance values similar to those observed in the absence of LGT. However, the degree of resolution actually increased with increasing rates of LGT. When the mean rate of LGT E was 250 nominal events per iteration, 40 bipartitions had bootstrap support of 70% or greater; 5 of these bipartitions were incongruent with the true tree. An increase in the number of resolved bipartitions was also observed at a bootstrap threshold of 90%, with the number of discordant bipartitions increasing from 1 to 2. Consequently, the gain in resolution appears to reinforce the simulated genome phylogeny; even if the histories of shared genes do not exactly match the reference tree, they can still aid in its recovery.
|
In all of the other three types of LGT scenario that were simulated, there was a decrease in the number of concordant nodes between E = 10 and E = 250 attempted transfers per iteration, although in some cases there were more concordant nodes at E = 50 than at E = 10. Under content-biased and random LGT in this replicate, the number of discordant bipartitions dropped to zero as E increased to 250, so the 20 strongly supported bipartitions that remained in each of these simulations were all in agreement with the true tree. However, a low rate of habitat-biased LGT was sufficient to introduce discordant relationships into the inferred tree; with E = 10 only 31 out of 38 bipartitions with BP
0.7 were concordant with the true tree (Fig. 5). And unlike the trees built from content-biased or random LGT, the proportion of strongly supported and discordant bipartitions increased with increasing E.
|
Nonparametric Kruskal-Wallace tests were performed across all five replicates (summarized in Fig. 6) to assess the significance of differences in percent concordance distribution among LGT scenarios. These tests were applied independently to a total of six combinations of E (10, 50, and 250) and bootstrap threshold (70% in Fig. 6a and 90% in Fig. 6b). For E = 10, the difference in distribution of percentage concordance was significant (P = 0.033, alpha threshold = 0.05) at a BP threshold of 70 but not significant (P = 0.73) at a BP threshold of 90. P-values of 0.015 and 0.005 were obtained for both BP thresholds with E = 50, and P < 0.005 when E = 250. Pairwise post hoc Nemenyi tests (Hollander and Wolfe, 1999) showed that only habitat-biased LGT produced trees that were significantly less concordant than any of the other groups. In each of the four sets of tests with E equal to 50 or 250 and a BP threshold of 0.7 or 0.9, the distribution of concordances from the five habitat-biased replicates differed from that of the random and gene content-biased trials but was statistically indistinguishable at
< 0.05 from that of the five relations-biased replicates. Because a speciation event yields two lineages that are adapted to the same set of niches, recently diverged genomes have a better than random probability of occupying the same habitat. This effect will be attenuated by independent habitat switching and gene loss but may be responsible for the similar distributions seen here.
|
Across all sets of unweighted genome trees, the bootstrap support for each node showed a strong relationship with the length of the branch supporting that node. Figure 7 shows this relationship for three sets of genome trees, corresponding to the five replicates simulated without LGT and the five replicates simulated with either habitat-biased or random LGT and E equal to 250 attempted events per iteration. Although the relationship between branch length and bootstrap support is similar, the trees simulated with random LGT have shorter branches and therefore lower overall bootstrap support. Random LGT leads to a greater average similarity among genomes, compressing the tree and confounding relationships that were clearly resolved in simulations lacking LGT. Conversely, habitat-directed LGT does not produce the same "squashing" of early branches in the tree, and the discordant nodes recovered have longer supporting branches and higher bootstrap support.
|
Although a single branch migration within a tree can disrupt potentially many bipartitions, multiple SPR operations were needed to reconcile all of the habitat-biased LGT trees (resolved at a bootstrap threshold of 70%) with the true tree of genomes (Table 1). Consequently, the discordance of these trees was not due to a single branch migration but to several such events.
Concordance-and Discordance-Weighted Trees
In separate analyses, the normalized BLASTP scores between proteins from replicate 1 were subjected to concordance and discordance weighting prior to distance matrix reconstruction. Under a concordance-weighting scheme, proteins with ranked similarities to putative orthologs in other genomes will have a high weight if the ranking is similar to the overall ranking of genome similarities. If a consistent pattern of similarities exists for many proteins from a given simulation, then these proteins will exert a strong influence on the overall genome similarity ranking and will make a disproportionately high contribution to the distance matrix as a consequence (Gophna et al., 2005).
The contribution from a subset of proteins with strong adherence to the genome similarity ranking might be expected to yield more well-supported bipartitions than are seen in the unweighted case. Figure 8 shows that this is true for E = 0, 10, and 50: the total number of bipartitions resolved at a BP threshold of 0.7 increased in all nine simulations (Fig. 8a). However, in four out of nine of these trials, there is a decrease in the proportion of total concordant bipartitions (Fig. 8b), so the additional information that emerges from concordance weighting is sometimes in disagreement with the true tree. The effect of concordance weighting when E = 250 is less clear: the number of resolved bipartitions decreases in the unbiased (random LGT) simulation and increases slightly in the other three simulations. In none of the replicates did PDS scores correlate significantly with the number of historical transfers for a given gene: many factors such as paralogy and compositional artifacts are intentionally captured in the PDS weighting process (Clarke et al., 2002), and a simple model of LGT frequency may not be sufficient to gauge its expected impact on phylogenetic discordance.
|
The effect of discordance weighting at low levels of LGT (E = 0 and E = 10) is a near-complete loss of all strongly supported bipartitions. All five simulations with E
10 produced trees with a maximum of eight bipartitions with a bootstrap value of 70% or greater: these bipartitions supported only the most recent divergences in the tree, although they were not invariably in agreement with the true tree. A drop in the number of resolved bipartitions relative to the unweighted trees was also seen when E = 50 and E = 250. In seven out of eight of these simulations, 100% of the resolved bipartitions were concordant, with the only exception being the habitat-biased simulation at E = 250. | Discussion |
|---|
|
|
|---|
Effectiveness of the Normalized BLASTP Method and Refinements
By including confounding effects such as gene duplications and losses, and changes in the underlying mutational biases of genomes, EvolSimulator produces data sets that can violate the assumptions of many phylogenetic reconstruction methods. Drifts in genomic G+C bias could potentially confound the BLASTP analysis by making distantly related genomes appear more similar to one another if they share similar genomic G+C biases: this effect could potentially be mirrored by compositional or functional convergence in real genomes. In spite of this, trees constructed using the normalized BLASTP method were able to correctly recover (with BP
0.7) > 60% of the bipartitions in the original simulated tree when LGT was absent from the simulation. The earliest, closely spaced branches in the tree, which describe the relationships between the major groupings of simulated taxa, were weakly supported and generally incorrect. This result provides an interesting contrast with the genome trees of Gophna et al. (2005), where nearly all recovered bipartitions had an associated BP of 100%. It is not uncommon for genomic phylogenies to offer strong support for deep relationships (Wolf et al., 2001), but these relationships may be reflective of compositional biases and unequal rates of evolution, which can overwhelm what little phylogenetic signal remains at great depths (Jermiin et al., 2004). The nonlinearity of the relationship between evolutionary distance and normalized BLASTP distance between pairs of genomes may influence the recovery of correct relationships. This phenomenon also likely contributes to the exaggerated length of recent branches and the consequent squashing seen at the base of the tree in Figure 4, because even a small number of iterations separating a pair of sister genomes will still yield a mean normalized BLASTP distance greater than 0.3. Further refinements to the normalized BLASTP method could include transformations of the normalized BLASTP distance and reweighting of distances between individual pairs of orthologs to yield linear relationships with lower variance. Concordance weighting of normalized BLASTP scores tended to increase the overall statistical support for the recovered genome phylogeny but with the drawback of increasing the support for a few discordant bipartitions to a point above the threshold of significance. In cases where true history and bias (compositional or otherwise) conflict in the relationships they support, concordance weighting will favor one solution over the other. It appears that in most but not all cases in this simulation, the balance favored the true vertical history. Ultimately, it may be worthwhile to investigate the role of bias in detail and examine the combined effects of concordance weighting and accounting for composition using modified evolutionary models (Jayasawal et al., 2007) or residue recoding (Phillips et al., 2004; Susko and Roger, 2007).
In these simulations, discordance weighting did not emphasize major pathways of LGT: in most cases the effect of weighting for discordance was to drastically decrease the statistical support for most relationships in the recovered tree. The most likely reason for this effect is the nature of vertical and lateral relationships in these simulations: whereas there is only one vertical history, in every class of simulation many different types of lateral relationships are expected. This is true even for the habitat-directed scenario, where organisms in our simulations could switch between habitats with frequencies that are likely higher than the extreme cases (e.g., Thermoplasmatales) described in the introductory section and below. In all cases, the resulting lateral relationships would be better represented with a network rather than a tree of putative lateral histories.
The Effects of Directed versus Random Exchanges
Two quantities were examined to assess the decay of phylogenomic signal with different rates and scenarios of LGT: the total number of strongly supported bipartitions in the reconstructed genome tree and the proportion of these bipartitions that were in agreement with the true tree. When LGT events preferentially occurred between closely related genomes, the overall effect was to increase the total number of resolved bipartitions; most (but not all) of these gained bipartitions were concordant. These events will decrease the effective time since divergence, with many proteins subjected to orthologous replacement.
Although gene content-biased LGT might have been expected to yield results similar to relations-biased LGT, in most cases the degree of resolution and concordance from the content-biased simulations was most similar to that obtained from random LGT. In fact, content-biased LGT and random LGT are equivalent if the gene content is identical among all organisms. Although gene content did vary across organisms in these simulations, the variation appears to have been neither sufficiently large nor consistent across lineages to distinguish the content-biased simulations from the random ones. This observation highlights an interesting distinction between our simulations and the expected empirical case as articulated by the Complexity Hypothesis (Jain et al., 1999): in our simulations, all genes were equally transferable, regardless of their importance to the organism or distribution across the simulated tree of life. The Complexity Hypothesis states that proteins involved in large complexes are resistant to transfer: because most of these proteins are informational and ubiquitous (or nearly so) in living organisms, the most frequently occurring proteins are therefore least likely to be transferred. Reducing the contribution of essential and ubiquitous genes to the calculation of shared gene content would have enhanced the differences among genomes and likely led to LGT scenarios that were more strongly influenced by phylogenetic relatedness and possibly habitat as well.
Unlike the other classes of simulation examined here, habitat-biased simulations always produced genome trees with features that were strongly supported but discordant with the true tree. Whereas LGT events can have disruptive influences on the recovered genome phylogeny, a random distribution of events should (absent other biases) merely reduce the statistical support for the correct phylogeny. But if gene-sharing events occur preferentially between certain lineages, as is the case with habitat-directed LGT, then the lateral signal may produce a strongly supported alternative to the vertical topology. The ultimate effect will likely not be a clear co-location of frequently exchanging taxa in the recovered genome tree but rather a phylogenetic compromise that is influenced by both the vertical and lateral histories but in fact displays neither.
A survey of published genome phylogenies, supermatrices, and supertrees shows the potential that many such effects may be influencing recovered trees. An example reported in the genome phylogeny work of Gophna et al. (2005) concerns the positioning of the Thermoplasmatales, which are typically placed near the base of the Archaea in genome phylogenies. It appears that the positioning of Thermoplasmatales reflects a compromise between its vertical history (shared with other Euryarchaeota) and a habitat-directed highway of gene sharing with the thermoacidophilic crenarchaeal genus Sulfolobus. A breakdown of recovered quartets from the phylogenomic analysis of Beiko et al. (2005) identified groups of Thermoplasma proteins with strong affinities for both the hypothetical vertical and lateral histories, with very little support for the reported positioning of this group as basal to the Archaea.
Other groups potentially affected by such vertical/lateral conflicts include the hyperthermophiles Aquifex aeolicus and Thermotoga maritima: there is strong phylogenetic and physiological (see, e.g., Cavalier-Smith, 2002) evidence that A. aeolicus is ancestrally an
-proteobacterium and T. maritima a low G+C Firmicute. Their frequent positioning together at the base of the tree may be a consequence of the preferential LGT that occurs among thermophilic lineages (including T. maritima with Archaeal groups such as Pyrococcus) in a manner similar to the habitat-biased regimes simulated here. The positioning of these taxa may be further influenced by biased nucleotide and amino acid compositions that affect the results of BLAST searches and phylogenetic analyses. When photosynthetic organisms from different domains are included in phylogenomic analyses, endosymbiotic events may influence the recovered phylogeny (Charlebois et al., 2004; Gophna et al., 2005).
The data sets simulated here could be used to validate other phylogenomic methods as well, including concatenation, supertrees, and parsimony methods such as conditioned reconstruction (Lake and Rivera, 2004). All of these methods are sensitive to phylogenetic incongruence, but some approaches may be better able to extract the historical, "vertical" signal from among a set of discordant relationships (if this is indeed the goal). For instance, supertree approaches may be preferable to sequence concatenation, because orthologous gene sets with weak and potentially misleading phylogenetic signal can be removed from the analysis if the trees they generate have weak statistical support (Bininda-Emonds, 2004). A concatenated alignment would need to consider every site from these genes and might therefore be expected to be more susceptible to "compromise" topologies.
Although the experiments performed in this analysis by no means recapture the entire complexity of microbial evolution, they illustrate the confounding effects of different scenarios of LGT and have important consequences for our ultimate ability to recover a tree (or network) of microbial life. It is perhaps startling that the very high incidence of LGT in some of our simulated data sets (particularly the random LGT simulation with E = 250) does not completely erase the vertical evolutionary signal in the data. The surprising persistence of vertical signal (albeit with relatively greater loss of ancient relationships) supports the idea of laterally transferred genes as "fibers in a rope" (Zhaxybayeva et al., 2004) that carry different individual histories but together constitute a cohesive picture of vertical evolution. Such a picture must depend (as does a supertree) on different (locally untransferred) genes carrying vertical signal in different parts of the tree and demands further elucidation through modeling. However, the introduction of bias into LGT scenarios leads to the conflation of different vertical and lateral signals in the resulting genome phylogeny. The true nature of LGT regimes in the wild will depend on mechanistic and selective factors that influence the probability of success of individual LGT events. If the majority of persistent LGT (as opposed to transient LGT; see, e.g., Hao and Golding, 2006) is confined to close relatives, then the vertical history or "tree of cells" will be reinforced by "xenologous" genes that diverged more recently than the last common ancestor of the cells that contain those genes. Although these genes do not match the reference phylogeny, they can contribute to its recovery and resolution. The recovered tree may provide clues to organismal evolution but cannot accurately represent the evolutionary history of genomes, because the genomes have not in fact evolved according to a tree-like process (Doolittle and Bapteste, 2007). Additionally, where LGT occurs in a non-random fashion between distant relatives, our ability to recover such a tree, or even indeed a significantly restricted network, will be severely compromised or lost.
| Supplemental Material |
|---|
|
|
|---|
The configuration files used in the above simulations and all of the genome trees used in this article are given in the Supplemental Material (http://www.systematicbiology.org). EvolSimulator 2.0.4 can be obtained freely from http://bioinformatics.org.au/evolsim.
| Acknowledgment |
|---|
|
|
|---|
The authors would like to thank Olaf Bininda-Emonds, James McInerney, James Lake, and Craig Herbold for helpful comments on the manuscript. This work was supported by Australian Research Council Grants DP0342987 and CE0348221 and by the Genome Canada-funded project "Understanding Prokaryotic Genome Evolution and Diversity" (W. F. Doolittle, PI).
| References |
|---|
|
|
|---|
-
Altschul S. F., Madden T. L., Schäffer A. A., Zhang J., Zhang Z., Miller W., Lipman D. J. Gapped BLAST and PSI-BLAST: A new generation of protein database search programs. Nucleic Acids Res. (1997) 25:3389–3402.
Auch A. F., Henz S. R., Holland B. R., Göker M. Genome BLAST distance phylogenies inferred from whole plastid and whole mitochondrion genome sequences. BMC Bioinformatics (2006) 7:350.[CrossRef][Medline]
Baldauf S. L., Roger A. J., Wenk-Siefert I., Doolittle W. F. A kingdom-level phylogeny of eukaryotes based on combined protein data. Science (2000) 290:972–977.
Beiko R. G., Charlebois R. L. A simulation test bed for hypotheses of genome evolution. Bioinformatics (2007) 23:825–831.
Beiko R. G., Hamilton N. Phylogenetic identification of lateral genetic transfer events. BMC Evol. Biol. (2006) 6:15.[CrossRef][Medline]
Beiko R. G., Harlow T. J., Ragan M. A. Highways of gene sharing in prokaryotes. Proc. Natl. Acad. Sci. USA (2005) 104:14332–14337.
Belda E., Moya A., Silva F. J. Genome rearrangement distances and gene order phylogeny in gamma-Proteobacteria. Mol. Biol. Evol. (2005) 22:1456–1467.
Bininda-Emonds O. R. P. The evolution of supertrees. Trends Ecol. Evol. (2004) 19:315–322.[CrossRef][Medline]
Brochier C., Forterre P., Gribaldo S. Archaeal phylogeny based on proteins of the transcription and translation machineries: Tackling the Methanopyrus kandleri paradox. Genome Biol. (2004) 5:R17.[CrossRef][Medline]
Cavalier-Smith T. The neomuran origin of Archaebacteria, the negibacterial root of the universal tree and bacterial megaclassification. Int. J. Syst. Evol. Microbiol. (2002) 52:7–76.[Abstract]
Charlebois R. L., Beiko R. G., Ragan M. A. Genome phylogenies. In: Organelles, genomes and eukaryote phylogeny: An evolutionary synthesis in the age of genomics—Hirt R. P., Horne D. S., eds. (2004) Boca Raton, Florida: CRC Press. 189–206.
Clarke G. D. P., Beiko R. G., Ragan M. A., Charlebois R. L. Inferring genome trees by using a filter to eliminate phylogenetically discordant sequences and a distance matrix based on mean normalized BLASTP scores. J. Bacteriol. (2002) 184:2072–2080.
Creevey C. J., Fitzpatrick D. A., Philip G. K., Kinsella R. J., O'Connell M. J., Pentony M. M., Travers S. A., Wilkinson M., McInerney J. O. Does a tree-like phylogeny only exist at the tips in the prokaryotes? Proc. Biol. Sci. (2004) 271:2551–2558.
Daubin V., Gouy M., Perrière G. Bacterial molecular phylogeny using supertree approach. Genome Inform. (2001) 12:155–164.[Medline]
Desper R., Gascuel O. Fast and accurate phylogeny reconstruction algorithms based on the minimum-evolution principle. J. Comput. Biol. (2002) 9:687–705.[CrossRef][Web of Science][Medline]
Doolittle W. F., Bapteste E. Pattern pluralism and the Tree of Life hypothesis. Proc. Natl. Acad. Sci. USA (2007) 104:2043–2049.
Dutilh B. E., Huynen M. A., Bruno W. J., Snel B. The consistent phylogenetic signal in genome trees revealed by reducing the impact of noise. J. Mol. Evol. (2004) 58:527–539.[CrossRef][Web of Science][Medline]
Fitch W. M., Margoliash E. Construction of phylogenetic trees. Science (1967) 155:279–84.
Galtier N. A model of horizontal gene transfer and the bacterial phylogeny problem. Syst. Biol. (2007) 56:633–642.
Gophna U., Doolittle W. F., Charlebois R. L. Weighted genome trees: Refinements and applications. J. Bacteriol. (2005) 187:1305–1316.
Hao W., Golding G. B. The fate of laterally transferred genes: Life in the fast lane to adaptation or death. Genome Res. (2006) 16:636–643.
Holland B. R., Huber K. T., Moulton V., Lockhart P. J. Using consensus networks to visualize contradictory evidence for species phylogeny. Mol. Biol. Evol. (2004) 21:1459–1461.
Hollander M., Wolfe D. A. Nonparametric statistical methods (1999) 2nd edition. New York: John Wiley & Sons.
Jain R., Rivera M. C., Lake J. A. Horizontal gene transfer among genomes: The complexity hypothesis. Proc. Natl. Acad. Sci. USA (1999) 96:3801–3806.
Jayaswal V., Robinson J., Jermiin L. Estimation of phylogeny and invariant sites under the general Markov model of nucleotide sequence evolution. Syst. Biol. (2007) 56:155–162.
Jermiin L., Ho S. Y., Ababneh F., Robinson J., Larkum A. W. The biasing effect of compositional heterogeneity on phylogenetic estimates may be underestimated. Syst. Biol. (2004) 53:638–643.
Kunin V., Goldovsky L., Darzentas N., Ouzounis C. A. The net of life: Reconstructing the microbial phylogenetic network. Genome Res. (2005) 15:954–959.
Lake J. A., Rivera M. C. Deriving the genomic tree of life in the presence of horizontal gene transfer: Conditioned reconstruction. Mol. Biol. Evol. (2004) 21:681–690.
Phillips M. J., Delsuc F., Penny D. Genome-scale phylogeny and the detection of systematic biases. Mol. Biol. Evol. (2004) 21:1455–1458.
Ragan M. A. On surrogate methods for detecting lateral gene transfer. FEMS Microbiol. Lett. (2006) 201:187–191.[CrossRef]
Ragan M. A., Harlow T. J., Beiko R. G. Do different surrogate methods detect lateral genetic transfer events of different relative ages? Trends Microbiol. (2006) 14:4–8.[CrossRef][Web of Science][Medline]
Sankoff D., Leduc G., Antoine N., Paquin B., Lang B. F., Cedergren R. Gene order comparisons for phylogenetic inference: Evolution of the mitochondrial genome. Proc. Natl. Acad. Sci. USA (1992) 89:6575–6579.
Snel B., Bork P., Huynen M. A. Genome phylogeny based on gene content. Nat. Genet. (1999) 21:108–110.[CrossRef][Web of Science][Medline]
Susko E., Roger A. J. On reduced amino acid alphabets for phylogenetic inference. Mol. Biol. Evol. (2007) 24:2139–2150.
Swofford D. L., Olsen G. J. Phylogeny reconstruction. In: Molecular systematics—Hillis D. M., Moritz C., eds. (1990) Sunderland, Massachusetts: Sinauer Associates. 411–501.
Weisburg W. G., Giovannoni S. J., Woese C. R. The Deinococcus-Thermus phylum and the effect of rRNA composition on phylogenetic tree construction. Syst. Appl. Microbiol. (1989) 11:128–134.[Web of Science][Medline]
Whelan S., Goldman N. A general empirical model of protein evolution derived from multiple protein families using a maximum-likelihood approach. Mol. Biol. Evol. (2000) 18:691–699.[Web of Science]
Wilkinson M., Pisani D., Cotton J. A., Corfe I. Measuring support and finding unsupported groups in supertrees. Syst. Biol. (2005) 54:823–831.
Woese C. R., Kandler O., Wheelis M. L. Towards a natural system of organisms: Proposal for the domains Archaea, Bacteria, and Eucarya. Proc. Natl. Acad. Sci. USA (1990) 87:4576–4579.
Wolf Y. I., Rogozin I. B., Grishin N. V., Tatusov R. L., Koonin E. V. Genome trees constructed using five different approaches suggest new major bacterial clades. BMC Evol. Biol. (2001) 1:8.[CrossRef][Medline]
Zhaxybayeva O., Gogarten J. P., Charlebois R. L., Doolittle W. F., Papke R. T. Phylogenetic analyses of cyanobacterial genomes: Quantification of horizontal gene transfer events. Genome Res. (2006) 16:1099–1108.
Zhaxybayeva O., Lapierre P., Gogarten J. P. Genome mosaicism and organismal lineages. Trends Genet. (2004) 20:254–260.[CrossRef][Web of Science][Medline]
This article has been cited by other articles:
![]() |
D. H. Parks, M. Porter, S. Churcher, S. Wang, C. Blouin, J. Whalley, S. Brooks, and R. G. Beiko GenGIS: A geospatial information system for genomic data Genome Res., October 1, 2009; 19(10): 1896 - 1904. [Abstract] [Full Text] [PDF] |
||||
![]() |
R. N. Re and J. L. Cook Senescence, apoptosis, and stem cell biology: the rationale for an expanded view of intracrine action Am J Physiol Heart Circ Physiol, September 1, 2009; 297(3): H893 - H901. [Abstract] [Full Text] [PDF] |
||||
![]() |
M. A. Ragan and R. G. Beiko Lateral genetic transfer: open issues Phil Trans R Soc B, August 12, 2009; 364(1527): 2241 - 2251. [Abstract] [Full Text] [PDF] |
||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||










