© 2005 Society of Systematic Biologists
An Empirical Assessment of Long-Branch Attraction Artefacts in Deep Eukaryotic Phylogenomics
Edited by Marshal Hedin
1 Canadian Institute for Advanced Research, Centre Robert Cedergren, Département de Biochimie, Université de Montréal, Succursale Centre-Ville Montréal, Québec H3C3J7, Canada; E-mail: herve.philippe{at}umontreal.ca (H.P.)
2 School of Biological and Chemical Sciences, Queen Mary, University of London Mile End Road, E1 4NS, London, UK
| Abstract |
|---|
|
|
|---|
In the context of exponential growing molecular databases, it becomes increasingly easy to assemble large multigene data sets for phylogenomic studies. The expected increase of resolution due to the reduction of the sampling (stochastic) error is becoming a reality. However, the impact of systematic biases will also become more apparent or even dominant. We have chosen to study the case of the long-branch attraction artefact (LBA) using real instead of simulated sequences. Two fast-evolving eukaryotic lineages, whose evolutionary positions are well established, microsporidia and the nucleomorph of cryptophytes, were chosen as model species. A large data set was assembled (44 species, 133 genes, and 24,294 amino acid positions) and the resulting rooted eukaryotic phylogeny (using a distant archaeal outgroup) is positively misled by an LBA artefact despite the use of a maximum likelihood–based tree reconstruction method with a complex model of sequence evolution. When the fastest evolving proteins from the fast lineages are progressively removed (up to 90%), the bootstrap support for the apparently artefactual basal placement decreases to virtually 0%, and conversely only the expected placement, among all the possible locations of the fast-evolving species, receives increasing support that eventually converges to 100%. The percentage of removal of the fastest evolving proteins constitutes a reliable estimate of the sensitivity of phylogenetic inference to LBA. This protocol confirms that both a rich species sampling (especially the presence of a species that is closely related to the fast-evolving lineage) and a probabilistic method with a complex model are important to overcome the LBA artefact. Finally, we observed that phylogenetic inference methods perform strikingly better with simulated as opposed to real data, and suggest that testing the reliability of phylogenetic inference methods with simulated data leads to overconfidence in their performance. Although phylogenomic studies can be affected by systematic biases, the possibility of discarding a large amount of data containing most of the nonphylogenetic signal allows recovering a phylogeny that is less affected by systematic biases, while maintaining a high statistical support.
Keywords: Distant outgroup; eukaryotic tree; long-branch attraction; microsporidia; multigene data sets; nucleomorph; rooting; species sampling; systematic biases
Received November 5, 2004; Revised February 7, 2005; Accepted April 30, 2005
Single-gene phylogenies are generally poorly resolved because the number of informative positions is limited and stochastic (random) noise yields contradictory, yet often poorly supported, results. Phylogenomics, that is the use of a large number of genes, or ultimately of complete genomes, in phylogenetic inference, is of great promise to overcome stochastic errors and to furnish statistically significant results. Recently, the analysis of several large data sets has allowed enhanced insight into long-term outstanding questions such as relationships of placental mammals (Madsen et al., 2001; Murphy et al., 2001) and angiosperms (Qiu et al., 1999; Soltis et al., 1999). However, conflicting results have also emerged. For example, the monophyly of Ecdysozoa (nematodes + arthropods) is strongly rejected by some phylogenomic analyses (Blair et al., 2002; Philip et al., 2005; Wolf et al., 2004) and strongly supported by others (Delsuc et al., 2005; Philippe et al., 2005).
The use of large data sets reduces the impact of the stochastic error (which will disappear only with infinite samples); however, it can exacerbate systematic errors, which can eventually become dominant. Systematic errors occur when the real evolutionary process differs from our oversimplified models (Phillips et al., 2004). They may also be found in the case of single genes, but are usually hidden by sampling errors. Although probabilistic methods like maximum likelihood (ML) or Bayesian approaches are known to be more robust to model violations (Hasegawa and Fujiwara, 1993; Sullivan and Swofford, 2001), heterotachy, defined as the heterogeneity of the evolutionary rate of a given position throughout time and compositional bias, can lead to inconsistency (Foster and Hickey, 1999; Inagaki et al., 2004; Kolaczkowski and Thornton, 2004; Lockhart et al., 1996; Philippe and Germot, 2000). For example, the minimum evolution method is inconsistent in the case of a large yeast data set of Rokas et al. (2003) because two unrelated species share a similar nucleotide composition. This can be corrected, however, by RY coding (Phillips et al., 2004).
Variable evolutionary rates among lineages constitute an important source of systematic bias. The long-branch attraction (LBA) artefact posits that the two longest branches will cluster together under certain conditions, irrespective of the true relationships of the sequences under study (Felsenstein, 1978). In the case of a distant outgroup (representing a long branch), LBA leads to the artefactual early emergence of the fast-evolving lineages of the ingroup (Philippe and Laurent, 1998). Although LBA artefacts were suspected to be present in various phylogenies (Bapteste et al., 2002; Dacks et al., 2002; Huelsenbeck, 1997; Nozaki et al., 2003; Qiu et al., 2001; Sanderson et al., 2000; Simpson et al., 2002; Stiller and Hall, 1999), they are difficult to discover and overcome (see the case of glires, Douzery et al., 2004). The most obvious way would be the use of a tree reconstruction method that is not sensitive to this artefact, but, unfortunately such a method does not yet exist. Probabilistic methods fail because the current models (even the most complex ones) do not reflect all facets of biological reality, not because of the method per se (Felsenstein, 2004; Lockhart et al., 1996). Simulation studies (Guindon and Gascuel, 2003; Huelsenbeck, 1998; Kuhner and Felsenstein, 1994; Qiu et al., 2001; Swofford et al., 2001; Wolf et al., 2004) have revealed that maximum parsimony (MP) is generally more sensitive than distance-based methods, whereas probabilistic methods are generally more robust. The different sensitivity of MP and probabilistic methods can help to detect if the LBA artefact is playing a major role (Germot et al., 1997; Huelsenbeck, 1998).
However, if all methods yield trees where long branches, such as fast-evolving species and outgroup, are clustered, the situation becomes much more complex. One possibility is to modify the taxonomic sampling so that only the slowest evolving species are included (Aguinaldo et al., 1997). Alternatively, the addition of species can alleviate the LBA artefact by dividing long internal branches (Hendy and Penny, 1989). In this case, the addition of slowly evolving species is much more efficient, whereas the addition of fast-evolving species makes things worse (Kim, 1996; Poe, 2003). Although the most efficient conditions of species addition are not known (Hillis et al., 2003; Rosenberg and Kumar, 2003), several cases of LBA were revealed by adding species (Anderson and Swofford, 2004; Dacks et al., 2001; Inagaki et al., 2004; Philippe, 1997).
Finally, when the species sampling is reasonable for a given phylogenetic problem, the removal of sequence positions can be an effective method. The fast-evolving positions, which are saturated by multiple substitutions, have lost much, if not all, of their phylogenetic signal and are especially sensitive to any systematic bias. The slow/fast (SF) method (Brinkmann and Philippe, 1999), which starts by selecting the slowest evolving positions, and then progressively adding faster evolving positions, can reveal a transition between a topology in which the long branches are not grouped and a topology dictated by the LBA artefact (Brinkmann and Philippe, 1999; Brochier and Philippe, 2002; Busse and Preisfeld, 2003; Delsuc et al., 2005; Hampl et al., 2004; Philippe et al., 2000b).
Rooting deep level phylogenies is of fundamental importance in understanding the origin of numerous groups, eukaryotes in particular (Forterre and Philippe, 1999; Lake and Rivera, 1994; Lopez-Garcia and Moreira, 1999; Martin and Müller, 1998; Poole et al., 1999). Because many groups only have a distantly related outgroup (e.g., marsupials versus placental mammals, gnetales/ gymnosperms versus angiosperms, Archaea versus eukaryotes), the probability of the erroneous early emergence of fast-evolving lineages is high when multiple genes are used. One can therefore legitimately ask the question: is it possible to confidently root deep level trees in a phylogenomic analysis, or in other words, to eschew the LBA artefact in the presence of a distant outgroup?
In this article, we tackle this question by studying a situation in which the phylogenetic position of two fast-evolving lineages is well-established a priori. We selected the eukaryotic phylogeny (Fig. 1) because the archaeal sequences represent a distantly related outgroup that should strongly attract any fast evolving eukaryotes. Two fast-evolving eukaryotes, the nucleomorph of the cryptophyte Guillardia theta, and the microsporidium Encephalitozoon cuniculi, were selected because their complete genomes had been sequenced (Douglas et al., 2001; Katinka et al., 2001). The nucleomorph originated in a secondary endosymbiotic event in which an entire red alga was engulfed by a flagellate host cell, and corresponds therefore to the remnant of the former red algal nucleus, which is now highly reduced. This interpretation is supported by phylogenetic data from the corresponding chloroplast genome (Douglas et al., 2001; Yoon et al., 2002) and by morphological characters (Gibbs, 1981). The position of microsporidia has been more controversial, but now a large body of evidence argues that microsporidia are closely related to fungi (Keeling and Fast, 2002), although their exact position within fungi remains uncertain (Keeling, 2003). To include the chytridiomycetes, an important group of fungi, we sequenced
1000 ESTs from Neocallimastix patriciarum. N. patriciarum is an anaerobic fungus that can be found in the digestive tract of herbivorous mammals, in both ruminants and nonruminants (Teunissen and Op den Camp, 1993). Interestingly, this organism does not possess classical aerobic mitochondria, but rather hydrogen-producing organelles called hydrogenosomes. Hydrogenosomes are modified mitochondria that completely lost their genome and respiratory functions (reviewed in Embley et al., 2003). The group chytridiomycetes, to which this organism belongs, is characterized by the presence of a flagellum, a unique property within fungi. For this reason, it is generally assumed that chytridiomycetes have a basal position within fungi (James et al., 2000). We assembled a large data set of 133 nuclear encoded genes from six archaeal outgroup and 33 slow- and 2 fast-evolving eukaryotic ingroup species, including the microsporidium Encephalitozoon cuniculi and the nucleomorph of the cryptophyte Guillardia theta. Because the two fast-evolving species were misplaced in preliminary analyses, four different approaches were used to study LBA artefacts: (1) the removal of the fastest evolving proteins, (2) the use of various tree reconstruction methods, (3) the use of diverse taxon samplings, and (4) phylogenetic inference without the distant outgroup.
|
| Materials and Methods |
|---|
|
|
|---|
Neocallimastix ESTs
Sequences were obtained from a previously constructed Neocallimastix patriciarum ZAP II cDNA library (Xue et al., 1992). An aliquot of this library containing a random collection of clones was excised by superinfection with helper phages according to the manufacturer's instructions (Stratagene). One thousand clones were randomly selected and subsequently analyzed by sequencing. A detailed description of the sequences will be provided elsewhere.
Assembling the Alignment
We added to the aligned data sets of 174 proteins used by Philippe et al. (2004) the amino acid sequences available in Genbank (nonredundant section) on December 2003, using a BLASTP search with a cutoff e-value corresponding to the highest value of the orthologous proteins in Archaea. We then added to the alignments the EST sequences from the chytridiomycete N. patriciarum, and EST, as well as genomic sequences, from several ongoing sequencing projects. We retrieved most of the sequences from GenBank through NCBI (http://www.ncbi.nlm.nih.gov) except for Cryptococcus neoformans(C. neoformans cDNA Sequencing Project at http://www.genome.ou.edu/cneo.html; and C. neoformans Genome Project, Stanford Genome Technology Center and the Institute for Genomic Research, at http://baggage.stanford.edu/group/C.neoformans/download.html), Dictyostelium discoideum (Genome Sequencing Center Jena website at http://genome.ibm-jena.de/ dictyostelium), Thalassiosira pseudonana (http://genome.jgi-psf.org/thaps1/thaps1.download.ftp.html), Phytophthora sojae (http://genome.jgi-psf.org/sojae1/sojae1.download.ftp.html), Tetrahymena thermophila (ftp://ftp.tigr.org/pub/data/Eukaryotic_Projects/t_thermophila/), and Monosiga brevicollis (http://projects.bocklabs.wisc.edu/carroll/choano/, King et al., 2003).
The sequences were added as described in Philippe et al. (2004). To deal with the problem of nonorthologous sequences, we constructed amino acid based phylogenies (MP and ML) starting with the original 174 proteins, of which the 133 proteins used to assemble our final phylogenomic data set represent a conservative subsample. At this step we also eliminated all proteins that had either too few species or too much missing data. The reliability of orthology assignment was greatly improved due to the use of numerous species. Genes for which orthology relationships were difficult to establish (e.g., EF-1
or cytosolic HSP70) were completely discarded from the analyses. When recent gene duplications were detected (almost exclusively for vertebrates), the slowest evolving gene copy was selected. We did not find in our individual gene data sets any case in which horizontal gene transfers would provide a reasonable explanation.
To assemble a data set rich in both species and genes, sequences can be missing or partial for some proteins from some species, because we compiled the sequences mainly from cDNA sequencing projects. To decrease the amount of missing data, we created chimerical sequences between closely related taxa (see Appendix 1, available at www.systematicbiology.org). We retained only species for which a sufficiently large number of amino acid residues were available (larger than 5000). Simulation studies have shown that under these conditions the impact of missing data is negligible (Philippe et al., 2004; Wiens, 2003). Moreover, the removal of the most incomplete taxa has no visible effect on the phylogenetic inference (Philippe et al., 2005).
In order to extract only unambiguously aligned portions and to eliminate divergent regions of the alignment, we used Gblocks (Castresana, 2000) with the following parameter settings: a minimum of 50% of the sequences per position identical for a conserved position, a minimum of 75% of the sequences identical for a flanking position, a maximum of five contiguous nonconserved positions, and a minimum of five positions for a block. This selection was manually verified; in particular, a few conserved regions with some amount of missing data, for which Gblocks was too stringent, were reintroduced into the dataset. A data set comprising 44 species (six Archaea, 33 slowly evolving eukaryotes, a microsporidium, a nucleomorph, and three kinetoplastids) and 133 genes (displaying a mean of
24% of missing data per species, Appendix 2, available at www.systematicbiology.org) was constructed. In a few cases the amount of missing data is quite high, with a maximum of 80% for the brown alga Laminaria. However, there are only seven species with less than 10,000 amino acid positions, and they are always closely related to almost complete species, so that no major eukaryotic lineages are only represented by highly incomplete taxa. The alignments are available upon request and nexus files of the two basic data sets (including two trees each; expected and LBA) were also submitted to TREEBASE under the study accession number SN2312.
Phylogenetic Analyses
Phylogenetic analyses were performed at the amino acid level. Various models of sequence evolution were considered. We used Poisson (the same probability for all pairs), WAG (Whelan and Goldman, 2001), or JTT (Jones et al., 1992) amino acid replacement matrices with and without gamma-distributed rates across sites (Yang, 1993). Two different models were applied: (1) the separate model (Yang, 1996) where branch lengths and the
parameter are free to vary for all genes, and (2) the concatenated model that considers all genes as a "super-gene."
Two important limitations for finding the best tree become prominent when a large number of positions and a large number of species are used: (1) pronounced local minima and (2) computing time and memory requirements. The height of the potential barriers separating local minima increases with the number of positions used (Salter, 2001). The probability that the heuristic search is trapped in a local minimum is therefore much higher. As a consequence, we used mainly exhaustive tree searches for the ML analyses. Since the number of possible topologies is too large for an exhaustive search (1053 for 39 species), we proceeded in two steps.
First, for the data set comprising the 33 slowly evolving eukaryotic species and the six Archaea, several heuristic searches were performed. The methods used were MP implemented in PAUP* (Swofford, 2000), ML using PHYML (Guindon and Gascuel, 2003) with a concatenated JTT+F+
4 model, and Bayesian inference in MrBayes (Ronquist and Huelsenbeck, 2003) with a concatenated WAG+F+
4 model (150,000 generations, burn in of 14,500 generations, 4 chains). The parameter F (frequency) corresponds to the use, as equilibrium frequencies, of the amino acid frequencies observed in the data sets under study, instead of the ones obtained for the original data set used to infer the amino acid replacement matrix (WAG or JTT). The high memory requirements of the probabilistic analyses based on the concatenated data sets limited the modeling of among site rate variation to the use of four discrete gamma categories (
4). Distance methods were not used to infer trees, because they are sensitive to the presence of missing data in the alignment. All MP analyses were always performed without constrained trees and applied the following options: heuristic search with TBR, 10 random species additions, and 1000 Bootstrap replicates. All Bayesian inferences were performed three times independently and always converged towards the same posterior distributions. In the PHYML analyses, the starting tree was obtained using ML-based distance estimates and the algorithm BIONJ (Gascuel, 1997), the ML tree is subsequently obtained by nearest neighbor interchange (NNI). Given the high number of positions, most of the nodes were as expected highly supported by all methods and were thus constrained in the subsequent analyses. Only the relationships among the six main eukaryotic lineages and among the four main fungal lineages were left unconstrained (Appendix 4, available at www.systematicbiology.org). These constraints define 14,175 topologies, which were analyzed with a concatenated JTT+F model by PROTML (Adachi and Hasegawa, 1996b). We then retained the 1000 best topologies for further analyses, as in Bapteste et al. (2002) and Philippe et al. (2004). These topologies were analyzed with a separate WAG+F+
model with the program Tree-Puzzle (Schmidt et al., 2002).
Second, we tried to locate the three fast-evolving lineages one at a time, namely the microsporidium Encephalitozoon, the nucleomorph of the cryptophyte Guillardia theta, and three kinetoplastids (Leishmania major, Trypanosoma brucei, and T. cruzi). Their possible locations in the phylogeny were analyzed exhaustively by adding them to all 75 branches of the 39 species tree (six Archaea and the 33 slowly evolving eukaryotes). However, because the topology of this tree is not known with certainty, we retained the 25 best topologies obtained with a separate WAG+F+
model. At first sight, 25 topologies may seem to be a small number compared to the 1053 possible topologies. However, the two best topologies received together 99% of the RELL bootstrap support and the 26th topology is less likely than the best one by ten orders of magnitude (
lnL = 221). A total of 1875 different topologies (25 x 75) was thus analyzed to locate each fast evolving lineage.
Because the computation of bootstrap values is the most demanding task, we used the RELL method (Kishino et al., 1990). More precisely, the likelihood values of each tree for each gene and the corresponding branch lengths were computed using Tree-Puzzle. The likelihood of each position for each tree was then computed using CodeML of the PAML package (Yang, 1997b). The site-wise likelihood values were used by a home-made program to compute the RELL bootstrap values of each topology based on 1000 replicas. The bootstrap values (BVs) for the placement of the fast-evolving lineages should not be underestimated by the RELL procedure, since, despite the fact that we analyzed only 1875 (25 x 75) topologies, all possible positions of the fast lineages in the tree were studied. This approach allowed us to perform all computations in a reasonable time (about 3 months on a cluster with 30 Xeon 2.8 GHz processors).
The fit of models to data was evaluated using the Akaike Information Criterion (AIC) (Akaike, 1973). According to Burnham and Anderson (2003), a delta AIC value greater than 10 means that the competing model receives no support. Tree comparisons were performed using the approximate unbiased (AU) and the Shimodaira-Hasegawa (SH) (Shimodaira and Hasegawa, 1999) tests as implemented in the program CONSEL (Shimodaira and Hasegawa, 2001).
Removal of Fast-Evolving Proteins
To test whether LBA affects phylogenetic inference, we devised a method coined Removal of Fast-evolving Proteins (RFP). The fastest evolving proteins were detected and selectively eliminated in a protein specific way (Fig. 2). The distances were estimated by ML using the program Tree-Puzzle with the same model as for ML tree inference. They were calculated for the concatenation of all proteins as well as separately for each protein. The mean distances between the Archaea and the fast evolving eukaryotic lineages under study (like Encephalitozoon) were then calculated for both the concatenation and each of the proteins. Thereafter, the genes were sorted according to the quotient obtained by the following formula: [dmean,gene (Fast,Archaea)]/[dmean,concat (Fast,Archaea)]. The greater the value, the faster the evolutionary rate for this protein in comparison to the mean value obtained for all concatenated proteins. As shown in Figure 2, the fastest evolving proteins from the fast-evolving lineage were selectively eliminated for a given protein and replaced by question marks, the sequences of all other species remaining unchanged. This selective elimination of proteins was performed by steps of 10%, up to 90%.
|
The RFP method does not assume an a priori knowledge of the "correct" phylogeny and is therefore topology independent. We remove up to 90% of the fastest evolving proteins (a limit that allows conserving sufficient phylogenetic information). The topology may change as a function of protein elimination or remain the same. We chose cases in which we expect that a certain change will eventually occur; however, this is mainly a control. The only a priori knowledge required by the RFP method is the nature of the outgroup. Here, Archaea are fairly undisputed outgroup of eukaryotes.
Simulation Studies
We generated 100 matrices of 40 taxa and 24,294 amino acid positions under PSeq-Gen (Grassly et al., 1997) using the model topology shown in Figure 1, except that the nucleomorph was not considered. A separate model was used for simulations. More precisely, empirical amino acid frequencies, alpha parameter, and branch lengths were estimated for each protein separately. Then, for each protein, sequences of the size of this protein were simulated using the protein-specific parameters. The phylogenies were then inferred using the same protocol as for real data. With MP, heuristic search with 10 random species additions and TBR swapping was performed. With ML, all positions of the fast-evolving lineage were considered, but only the 10 best topologies connecting the 33 slow-evolving species, instead of 25, were retained, for computing time reasons. Simulation studies were also performed using a concatenated model, and the results were virtually identical to the separate model (data not shown). It should be noted that for the species rich data sets (32 taxa or more), only 10 replicates were analyzed with ML because of computing time limitations. However, because we obtained 100% for all 10 replicates, it is unlikely that the analysis of more replicates will fundamentally change the results.
| Results and Discussion |
|---|
|
|
|---|
Removal of the Fastest Evolving Microsporidial and Nucleomorph Proteins
To simplify the study, the two fast-evolving species were analyzed separately. Beginning with microsporidia, a ML tree based on 133 genes (24,294 positions) inferred using either a separate WAG+F+
model or a concatenated JTT+F+
is shown in Figure 3. The tree is in excellent agreement with previous studies of eukaryotic phylogeny (Baldauf et al., 2000; Philippe et al., 2004). In particular the monophyly of all major phyla, for example Fungi, Metazoa plus Choanoflagellata (Holozoa), Conosa, green plants, stramenopiles, and Apicomplexa are recovered. Moreover, the monophyly of Opisthokonta (Fungi + Holozoa), Alveolata (Apicomplexa + ciliates), and Plantae (red algae + green plants) is found. However, the monophyly of Chromalveolata (alveolates and stramenopiles) (Cavalier-Smith, 2000; Fast et al., 2001) is not recovered. Within fungi, the grouping of ascomycetes and basidiomycetes, to the exclusion of chytridiomycetes and glomales, is supported by a bootstrap value (BV) of 100%. The early emergence of chytridiomycetes, until now only confirmed by a multigene phylogeny based on the mitochondrial genome (Bullerwell et al., 2003), is recovered, but not significantly supported. BVs are 86% and 66% for the separate and the concatenated analyses, respectively.
|
The microsporidium Encephalitozoon emerges at the base of eukaryotes with a high support (BV around 100%). An LBA artefact between the distantly related Archaea and the fast-evolving microsporidium likely explains this result. In fact, systematic biases constitute a serious issue when large data sets are used, even with a ML method and a reasonable species sampling (Philippe et al., 2005). However, the 133 genes of our data set do not all evolve at the same evolutionary rate in the microsporidial lineage. Therefore, in an attempt to overcome systematic biases, we assumed that the proteins that evolved the most slowly in microsporidia display a higher phylogenetic/nonphylogenetic signal ratio. We use the RFP method that progressively eliminates the fastest evolving proteins for microsporidia and studied the effect on phylogenetic inference (see Fig. 2 and Material and Methods for a detailed description). Only proteins of the fast-evolving species were removed, in order to maintain a large data set, given the difficulty in resolving the eukaryotic phylogeny with significant support (Philippe et al., 2000a; see Appendix 5 for the list of genes eliminated, available at www.systematicbiology.org).
As shown in Figure 4A, the application of the RFP method has a profound impact on the phylogenetic position of microsporidia. The removal of 50% of the fastest microsporidial proteins leads to a slight decrease of the BV for the early emergence of this group (from 97% to 78%). The removal of more proteins decreases these BVs much more rapidly, converging to 0% for a removal of 80% and 90%. This decrease could be simply due to the fact that too many proteins are removed and no phylogenetic signal remains. However, BVs for the grouping of microsporidia with fungi shows exactly the complementary trend, eventually converging to 100%. More precisely, the sum of the BVs for these two alternative positions of microsporidia (at the base of eukaryotes or with fungi) is always 100%. Therefore, our analysis strongly suggests that only two mutually exclusive signals exist for microsporidia: a nonphylogenetic signal due to LBA pulling them towards Archaea, and a genuine phylogenetic signal attracting them towards fungi. It should be noticed that both signals are strong. For example, with only 10% of the microsporidial proteins remaining (3709 positions), the grouping with fungi is supported by a BV of 100%. Even with a probabilistic tree reconstruction method using a complex model and a reasonable taxonomic sampling, it is necessary to remove an important fraction of the proteins, corresponding to the noisiest data, in order to avoid the LBA artefact. Interestingly, this also allows recovery of the expected phylogeny.
|
We also applied the RFP method in the case of the nucleomorph (Fig. 4B). Exactly the same tendency is observed: the support for the apparently artefactual position (nucleomorph at the base of eukaryotes) decreases with sequence removal. Nevertheless, analysis of the complete dataset recovers the expected position of the nucleomorph (sister-group of red algae), but only with a BV of 58%. The support for this position rises to 95% at the removal of only 60% of the fast evolving nucleomorph proteins. The increase continues to a BV of 99% when additional proteins are removed. The difference between Figures 4A and 4B suggests that either the genuine phylogenetic signal is higher for nucleomorph than for microsporidia or the nonphylogenetic signal due to LBA is lower. Wiens (1998) shows that missing data may enhance the LBA artefact, because this mimics poor species sampling. However, our study shows that increasing the amount of missing data up to 90% allows the reduction of the LBA artefact, simply because the proteins that evolved the fastest in the lineage affected by the LBA have been removed. The relationships between LBA and missing data are thus complex and deserve further studies. Very recently, by using simulations, Wiens (2005) demonstrates the ability of incomplete taxa to reduce LBA when they break the long branches, in particular for model-based methods.
Relative Efficiency of Diverse Tree Reconstruction Methods
In order to evaluate the sensitivity of various tree reconstruction methods to the LBA artefact, we applied MP and ML methods to both the microsporidium (Fig. 5A) and the nucleomorph (Fig. 5B) data sets. In the case of the ML method, we compared the efficiency of models that deal with three kinds of heterogeneity in the evolutionary process: (1) the heterogeneity of amino acid replacement rates by comparing the Poisson, which assumes that all substitutions are equally likely, and the WAG replacement matrices (Whelan and Goldman, 2001); (2) the heterogeneity of replacement rates among positions (uniform or
-distributed rates); (3) the heterogeneity of evolutionary rates between genes and species by comparing a concatenated model and a separate model that allows branch lengths and alpha parameter to vary from gene to gene (Yang, 1996). The evaluation of the relative efficiency is straightforward based on Figure 5: the better a given tree reconstruction method, the sooner (with a lower number of removed proteins) it will allow the recovery of a phylogeny not affected by the LBA artefact.
|
The only nonprobabilistic method applied, the MP method, performed poorly in both cases with BV of 0% for the expected solution (Fig. 5) and for all data sets up to 80% of protein removal. The BVs were different from 0% (up to 6% for microsporidia) only when 90% of the proteins were removed. The ML method with a simple and unrealistic model (separate Poisson+F without gamma) performs much better, recovering for example the monophyly of fungi + microsporidia with a BV of 94% when 90% of the fast proteins are removed. These results, obtained with real sequences, confirm previous results based on simulations (Anderson and Swofford, 2004; Huelsenbeck, 1998; Kuhner and Felsenstein, 1994; Qiu et al., 2001; Swofford et al., 2001). When some of the lineages evolve at markedly different rates, the use of probabilistic methods should be preferred over MP. A recent study (Kolaczkowski and Thornton, 2004) have demonstrated that MP outperforms ML when the level of heterotachy is extreme. However, this conclusion was based on simulation studies assuming a molecular clock and this does not hold when evolutionary rates vary considerably among lineages (unpublished results).
Considering the models of amino acid replacement, the Poisson model appears to be always less efficient than the WAG model (Fig. 5). For example, in the case of the nucleomorph with a
distribution, it is necessary to remove 90% of the nucleomorph proteins to obtain a BV of 95% with a Poisson model, whereas the same BV is obtained through the removal of only 60% of the proteins with the WAG model (Fig. 5B). Taking the among site rate variation into account by the use of a
distribution is also much more efficient against the LBA artefact both under Poisson and WAG matrices. These results demonstrate that ignoring the heterogeneity of the evolutionary process (for amino acid replacements and among positions) drastically reduces the accuracy of ML-based tree reconstruction methods.
Allowing for the possibility that different species evolve at different rates for different proteins produced less clear-cut results. For example, in the case of microsporidia, the concatenated WAG+F model is more sensitive to LBA than the separate WAG+F model, its performance being similar to that of the separate Poisson+F model (Fig. 5A). However, when a
distribution is used, the concatenated and the separate models have similar efficiency. Indeed, in the case of the nucleomorph and a WAG+F+
model, the concatenated analysis performs slightly better than the separate model, except when more than 80% of the proteins were removed.
Fit of the Model to the Data and Phylogenetic Accuracy
Because systematic errors occur when simplified models of sequence evolution used by the ML method are in conflict with the real evolutionary process, we evaluated how well the various models fit the data. We computed the AIC of each model for the nucleomorph data set (Table 1); the results are virtually identical for microsporidia (data not shown). As expected, the Poisson amino acid replacement matrix performs more poorly than JTT and WAG, whereas the WAG matrix has a slightly better fit to the data than JTT. The gamma distribution also improves greatly the fit of the model to the data (e.g., with separate model lnL = –744,406 WAG+F and lnL = –715,969 WAG+F+
). Despite a serious increase in the number of parameters (12,804 additional parameters), the separate model has a better fit than the concatenated model (Table 1), according to the AIC. Therefore, taking into account the heterogeneity in the evolutionary process always improves the fit of the model to the data, albeit to noticeably different extents.
|
The comparison of Table 1 and of Figure 5 confirms the hypothesis that using better models produces generally better phylogenies, in other words, that model misspecifications are the reason of the inconsistency of ML approaches. However, this relationship does not always hold (see Yang, 1997a), because the concatenated model sometimes performs better than the separate model, despite the fact that the separate model has a better fit. A possible explanation is that the estimation of branch lengths for each protein using a separate model is difficult, because only a limited number of positions are available. In contrast, this estimation is easier under the concatenated model. As a result, the microsporidial/nucleomorph branch is recognized as being very long, this allows the ML approach with a concatenated model to correct more efficiently for LBA artefacts.
Even the most complex models that we investigated (i.e., those readily available in current software packages) are sensitive to the LBA artefact; therefore the need for developing better tree reconstruction methods, in particular probabilistic ones with improved models of molecular evolution, is obvious. The protocol proposed here (Figs. 2 and 5) can be used as a way of assessment: a new method (model) will perform better if less data from fast-evolving species have to be removed in order to obtain the same BVs in favor of the grouping not affected by LBA. In particular, this benchmark could be used to test the efficiency of recently proposed methods with improved models, which deal with intrasite rate heterogeneity (i.e., heterotachy, Galtier, 2001; Huelsenbeck, 2002; Kolaczkowski and Thornton, 2004) and with the heterogeneity of the substitution process across sites (Lartillot and Philippe, 2004; Pagel and Meade, 2004).
Species Sampling and Sensitivity to LBA Artefacts
In phylogenomic studies, alignments contain often few taxa (Blair et al., 2002; Lerat et al., 2003; Philip et al., 2005; Rokas et al., 2003; Wolf et al., 2004). However, the accuracy of phylogenetic inference based on species-poor data sets is the subject of a long-standing controversy (Graur and Higgins, 1994; Hillis et al., 2003; Philippe and Douzery, 1994; Rosenberg and Kumar, 2003). To study the effect of species sampling, we progressively reduced the number of ingroup as well as outgroup species (see Appendix 3, available at www.systematicbiology.org, for the list of species used), while maintaining the number of positions (24, 294) and the method (ML with a separate WAG+F+
model) constant.
In the case of the nucleomorph, the sensitivity to LBA generally increases as the number of species decreases (Fig. 6B). However, the performance obtained with 15 species (six Archaea, the nucleomorph, and eight eukaryotic species representing the major lineages; open diamond) is virtually identical to 40 species (open square), suggesting that the use of a single representative per major group is sufficient in this case. Nevertheless, the removal of a single additional eukaryotic species (14 species, open triangle) noticeably diminishes the efficiency. More significantly, when only a green plant and a red alga are used as representatives of the slowly evolving eukaryotes (close triangle), BVs for the expected position of the nucleomorph were always below 64%. With only three archaeal outgroups (close diamond), BVs were always below 30%, suggesting that the use of six outgroup species improves the inference.
|
The curves of the BVs for the grouping of nucleomorph with red algae are not perfect monotonous increasing functions of the percentage of proteins removed (Figs. 5B and 6B). For example, there is a slight decrease of BV when the first 10% proteins are removed. Two reasons probably explain the complexity of the curves. First, the RFP method is far from being perfect, one problem is that the fastest evolving proteins are not optimally detected by this method, because the power of the relative rate test is limited. (Bromham et al., 2000; Philippe et al., 1994). Second, after the removal of 90% of the proteins, 1885 amino acid positions were remaining for the nucleomorph. This low number of positions implies an increasing influence of the sampling error, rendering the curves irregular.
The results for microsporidia are similar (Fig. 6A). With six or nine species, even when 90% of the fast-evolving proteins are removed, the BVs for the grouping of Encephalitozoon with fungi remain below 10%. One of the most efficient tree reconstruction method used in this study (a separate WAG+F+
model) is unable to overcome the LBA artefact, if only a few species are considered. Therefore taxa-poor phylogenomic studies should be regarded with great caution when species evolve at heterogeneous rates, in agreement with earlier studies (Adachi and Hasegawa, 1996a; Philippe and Douzery, 1994). For example, the paraphyly of Ecdysozoa observed in the analyses based on 100 genes/4 species (Blair et al., 2002), 500 genes/6 species (Wolf et al., 2004), and 780 genes/10 species (Philip et al., 2005) is most likely an artefact due to the high evolutionary rate of nematodes. This interpretation is in agreement with a study based on much wider taxon sampling, 146 genes/49 species (Philippe et al., 2005). It should be noticed that the species sampling used in this study can be easily improved, in particular by including several microsporidia and nucleomorphs in order to break their long branches. We predict that the quantity of data that have to be removed in order to overcome LBA will diminish accordingly.
However, the effect of taxon sampling is not based solely on the number of species, but also depends on the identity of the species (Lecointre et al., 1993). For example, for the nucleomorph (Fig. 6B), the LBA artefact is less marked when 15 species (open diamond) are used instead of 14 species (open triangle), whereas the contrary is observed for microsporidia (Fig. 6A). The nature of the outgroup can also have a great influence. The LBA is more pronounced in the case of the nucleomorph, when only Pyrococcus (open circle) instead of all six archaeal species (open square) is used as outgroup; this sample with 35 species is even worse than the samples with 14 or 15 species (Fig. 6B). However, in the case of microsporidia (Fig. 6A), the results with one or six Archaea are quite similar, demonstrating that the effect of taxon sampling on phylogenetic inference can be tremendously difficult to predict.
The analyses in which the closest sister-group of the fast-evolving lineages is discarded, corresponding to red alga for nucleomorph and fungi for microsporidia (indicated by close squares), are particularly interesting. In theory, the fast species should remain at the same position in the tree: they are expected to be a sister-group of green plants and of animals, respectively. Unfortunately, for the nucleomorph, even with the removal of 90% of the fast evolving proteins, the BVs for the expected position remain below 5% (Fig. 6B). Contrary to all previous analyses, there are now more than two alternative positions for the nucleomorph, because the sum of the BVs for the expected and the basal positions is sometimes less than 100%. Nevertheless, the support for the nucleomorph as first emerging eukaryotes is always greater than 85%, indicating that it is not possible to overcome LBA. For microsporidia (Fig. 6A), the situation is less drastic since the expected position, as a sister-group of animals, is recovered with a BV of 78% if the sequence removal is maximal (90%). This difference between nucleomorph and microsporidia is at first sight surprising, because it represents the only case in which the inference is easier for microsporidia. This is likely due to the fact that the recovery of the monophyly of opisthokonts, in this case microsporidia and animals, is less difficult than the one of Plantae, represented by the nucleomorph and green plants. Indeed, in another study using only slowly evolving species (Rodríguez-Ezpeleta et al., 2005), we have shown that it is necessary to use 5000 and 25,000 positions for obtaining a BV of 95% for the monophyly of opisthokonts and of Plantae, respectively.
An important conclusion can be drawn from the latter analyses: even when a large number of species and positions and an efficient tree reconstruction method are used, it turns out to be almost impossible to locate the fast-evolving lineages in the absence of closely related species in the data set. This probably explains why we were unable to place kinetoplastids (Leishmania major, Trypanosoma brucei, and T. cruzi) when we applied the RFP method. When the fast proteins are removed, the support for their early emergence decreases more quickly than in the case of the nucleomorph without red alga (from 100% with the complete alignment to 18% with 80% of kinetoplastid proteins removed). However, kinetoplastids do not cluster strongly with any group present in our dataset, the best BV being 34% for their grouping with Plantae (data not shown). Locating the fast-evolving eukaryotic groups such as kinetoplastids, diplomonads, or trichomonads with an archaeal outgroup will thus be a difficult and long-lasting task. The most straightforward approach would be to identify a slowly evolving and closely related group to these taxa. Thus, it is expected that several fast-evolving eukaryotic groups will artefactually remain at the base of the eukaryotic tree with a strong support, when numerous genes are used (Bapteste et al., 2002), until both improved species sampling and methodologies become available.
Phylogenetic Analyses in the Absence of the Distant Outgroup Archaea
To overcome the strong attraction between the distant archaeal outgroup and the fast-evolving ingroup, we have shown the need for good species sampling, an efficient tree reconstruction method and the removal of an important part of the fastest evolving proteins. As an alternative, the removal of the outgroup could allow the placement of problematic species, even if the question of the location of the root in the tree remains unsolved. The data sets without Archaea were analyzed separately with MrBayes and PHYML for both microsporidia and the nucleomorph. The results are strikingly different: the expected position of the fast-evolving species was recovered by MrBayes in both analyses with and without gamma-distributed rates, whereas either ciliates or alveolates and the fast-evolving species grouped together in the PHYML analyses. To verify that this difference is due to problems of the heuristic search (and not to a difference between ML and Bayesian approaches), various topologies were compared by LRT tests (Table 2). The expected position of both microsporidia and nucleomorph corresponds to the best ML tree and the LBA tree is always significantly rejected. The heuristic search of PHYML remains therefore trapped in a local minimum, illustrating the difficulty of heuristic searches when large data sets are considered. This argues in favor of our approach that combines topological constraints and an exhaustive search. However, when the closest sister-group of the fast-evolving species is eliminated (either the rhodophyte or fungi), the results of the analyses without outgroup are much less encouraging (Table 2). Nevertheless, our results confirmed the validity of the outgroup removal strategy for studying difficult phylogenetic questions.
|
However, the removal of the outgroup is not necessarily the panacea: instead of being attracted by the outgroup, the fast-evolving lineage can be attracted by the longest ingroup branch (Philippe et al., 2005). To study this possibility, we have analyzed simultaneously nucleomorph and microsporidia (Table 3). Both fast-evolving species are at the expected position in the ML tree. However, the three alternative LBA artefact-based topologies are only significantly rejected with a
model and when a closely related and slowly evolving sister group is present. We have also tested the heuristic search of MrBayes and of PHYML and confirmed that MrBayes always recovered the ML tree and PHYML the LBA tree. Finally, the MP analyses invariably group the two fast-evolving species together with a 100% bootstrap support. They formed a sister-group to ciliates, the fastest of the remaining eukaryotic species. The same highly supported sister-group relationship was also found by MP analyses including only one of the fast species. These analyses confirm the high sensitivity of the MP approach to LBA artefacts.
|
To gain insights regarding the position of the microsporidium Encephalitozoon within fungi, analyses in the absence of Archaea and more distantly related eukaryotes were carried out. Therefore, fungi and the microsporidium, together with animals, choanoflagellates, and the Conosa as outgroup sequences, were analyzed, using the RFP method with a separate WAG+ F+
model. When 80% of the fastest proteins are removed (Fig. 7), the microsporidium is no longer in a basal position with respect to the fungi, but emerges after the chytridiomycete Neocallimastix, although only weakly supported by a BV of 55%. This analysis suggests that microsporidia emerge within fungi, but our limited sample of chytridiomycetes and glomales and their incompleteness (8309 and 5490 amino acid positions, respectively) reduces the efficiency of our approach. The absence of Entomophthorales and Zoopagales, groups that have been proposed to be closely related to microsporidia (Keeling, 2003) is problematic, but EST sequencing of additional fungi (http://amoebidia.bcm.umontreal.ca/public/pepdb/agrm.php) will soon allow us to address this problem with an adequate species sampling.
|
Comparison of Simulated and Real Sequences
Our analyses demonstrate that the accuracy of current phylogenetic inference approaches are rather limited vis à vis LBA artefacts. However, simulation studies suggest that most methods are rather robust with respect to variable evolutionary rates among lineages (Guindon and Gascuel, 2003; Huelsenbeck, 1998; Kuhner and Felsenstein, 1994; Qiu et al., 2001; Swofford et al., 2001; Wolf et al., 2004). To gain further insights into this conundrum, we performed simulations to mimic the difficult case of microsporidia. Sequences were simulated with a complex model (separate JTT+F+
) and trees were inferred by MP and by ML using various models. As shown in Table 4, even without any data removal, all methods, including MP, perform well, except when only three eukaryotic species are used (six and nine species). In these cases, ML requires the use of a
model to recover the correct tree with high support. However, even an unrealistic model (Poisson+F instead of JTT+F+
) recovers an important signal for the correct position of the fast evolving species (BV close to 50%) when so few species are used. Table 4 also clearly illustrates that inconsistency of the ML approach is due to model misspecifications, because the correct tree is always recovered when the correct model is used. It should be remembered that, with real data, even with the most complex model and the removal of 90% of the noisiest proteins, the expected position of microsporidia was virtually unsupported when few species are used (BV below 10%, Fig. 6A).
|
| Conclusion |
|---|
|
|
|---|
All our analyses demonstrate that tree reconstruction methods are robust to the LBA artefact only when using simulated data. This suggests that simulation studies should be used with great care to evaluate whether a result is due to an LBA artefact. More importantly, experiments based on simulations had lead to overconfidence in the accuracy of tree reconstruction methods. We therefore believe that systematic errors, in particular due to LBA, constitutes a problem that should not be neglected in phylogenomics studies (Delsuc et al., 2005). To reduce their impact, we have shown that it is fundamental to (1) use probabilistic methods with complex models, (2) use a rich species sampling (including slowly evolving taxa closely related to the fast-evolving ones), and (3) remove a large proportion of the fast-evolving data.
In fact, a promising avenue in phylogenomics is to take advantage of the large number of positions available through the use of a subset of the data representing the most reliable characters, in order to obtain a phylogeny that minimizes systematic errors while remaining statistically significant. The fact that the RFP method is eliminating entire proteins from fast-evolving lineages (Fig. 2) does not mean that fast-evolving proteins are completely devoid of phylogenetic signal. A positional approach (Brinkmann and Philippe, 1999; Burleigh and Mathews, 2004; Pisani, 2004) could provide a better performance because it would more specifically remove the positions that mainly contain nonphylogenetic signal. We are currently evaluating the performance of these refined methods on the data sets used here.
| Acknowledgements |
|---|
We wish to thank Frédéric Delsuc, Martin Embley, Nicolas Lartillot, Yu Liu, Nicolas Rodrigue, and Naiara Rodríguez-Ezpeleta for helpful comments on an earlier version of the manuscript. Furthermore we are grateful to associate editor Marshal Hedin and the two reviewers, John W. Stiller and Mark Fishbein, for suggestions helping to improve the manuscript. We thank Eric Bapteste for his help in aligning sequences of Neocallimastix and Glomus. HP was supported by the Canada Research Chair Program, the Université de Montréal, and a Bioinformatics Grant of Génome Québec.
| References |
|---|
|
|
|---|
-
Adachi J., Hasegawa M. Instability of quartet analyses of molecular sequence data by the maximum likelihood method: The Cetacea/Artiodactyla relationships. Mol. Phylogenet. Evol. (1996a) 6:72–76.[CrossRef][Web of Science][Medline]
Adachi J., Hasegawa M. MOLPHY version 2.3: Programs for molecular phylogenetics based on maximum likelihood. Comput. Sci. Monogr. (1996b) 28:1–150.
Aguinaldo A. M., Turbeville J. M., Linford L. S., Rivera M. C., Garey J. R., Raff R. A., Lake J. A. Evidence for a clade of nematodes, arthropods and other moulting animals. Nature (1997) 387:489–493.[CrossRef][Medline]
Akaike H. Information theory and an extension of the maximum likelihood principle. In: Proceedings 2nd International Symposium on Information Theory—Petrov B. N., Csaki F., eds. (1973) Budapest: Akademia Kiado. Pages 267–281.
Anderson F. E., Swofford D. L. Should we be worried about long-branch attraction in real data sets? Investigations using metazoan 18S rDNA. Mol. Phylogenet. Evol. (2004) 33:440–451.[CrossRef][Web of Science][Medline]
Baldauf S. L., Roger A. J., Wenk-Siefert I., Doolittle W. F. A kingdom-level phylogeny of eukaryotes based on combined protein data. Science (2000) 290:972–977.
Bapteste E., Brinkmann H., Lee J. A., Moore D. V., Sensen C. W., Gordon P., Durufle L., Gaasterland T., Lopez P., Muller M., Philippe H. The analysis of 100 genes supports the grouping of three highly divergent amoebae: DictyosteliumEntamoebaMastigamoeba. Proc. Natl. Acad. Sci. USA (2002) 99:1414–1419.
Blair J. E., Ikeo K., Gojobori T., Hedges S. B. The evolutionary position of nematodes? BMC Evol. Biol. (2002) 2:7.[CrossRef][Medline]
Brinkmann H., Philippe H. Archaea sister group of Bacteria? Indications from tree reconstruction artifacts in ancient phylogenies. Mol. Biol. Evol. (1999) 16:817–825.[Abstract]
Brochier C., Philippe H. Phylogeny: A non-hyperthermophilic ancestor for bacteria? Nature (2002) 417:244.[CrossRef][Medline]
Bromham L., Penny D., Rambaut A., Hendy M. D. The power of relative rates tests depends on the data. J. Mol. Evol. (2000) 50:296–301.[Web of Science][Medline]
Bullerwell C. E., Forget L., Lang B. F. Evolution of monoblepharidalean fungi based on complete mitochondrial genome sequences. Nucleic. Acids Res. (2003) 31:1614–1623.
Burleigh J. G., Mathews S. Phylogenetic signal in nucleotide data from seed plants: Implications for resolving the seed plant tree of life. Am. J. Bot. (2004) 91:1599–1613.
Burnham K. P., Anderson D. R. Model selection and multimodel inference: A practical information-theoretic approach, 2nd ed. (2003) New York: Springer-Verlag.
Busse I., Preisfeld A. Systematics of primary osmotrophic euglenids: A molecular approach to the phylogeny of DistigmaAstasia (Euglenozoa). Int. J. Syst. Evol. Microbiol. (2003) 53:617–624.
Castresana J. Selection of conserved blocks from multiple alignments for their use in phylogenetic analysis. Mol. Biol. Evol. (2000) 17:540–552.
Cavalier-Smith T. Membrane heredity and early chloroplast evolution. Trends Plant. Sci. (2000) 5:174–182.[CrossRef][Web of Science][Medline]
Dacks J. B., Marinets A., Doolittle W., Cavalier-Smith T., Logsdon J. M. Jr. Analyses of RNA Polymerase II genes from free-living protists: Phylogeny, long branch attraction, and the eukaryotic big bang. Mol. Biol. Evol. (2002) 19:830–840.
Dacks J. B., Silberman J. D., Simpson A. G., Moriya S., Kudo T., Ohkuma M., Redfield R. J. Oxymonads are closely related to the excavate taxon Trimastix. Mol. Biol. Evol. (2001) 18:1034–1044.
Delsuc F., Brinkmann H., Philippe H. Phylogenomics and the reconstruction of the Tree of Life: Methods, advances, and challenges. Nat. Rev. Genet. (2005) 6:361–375.[Web of Science][Medline]
Douglas S., Zauner S., Fraunholz M., Beaton M., Penny S., Deng L. T., Wu X., Reith M., Cavalier-Smith T., Maier U. G. The highly reduced genome of an enslaved algal nucleus. Nature (2001) 410:1091–1096.[CrossRef][Medline]
Douzery E. J., Snell E. A., Bapteste E., Delsuc F., Philippe H. The timing of eukaryotic evolution: Does a relaxed molecular clock reconcile proteins and fossils? Proc. Natl. Acad. Sci. USA (2004) 101:15386–15391.
Embley T. M., van der Giezen M., Horner D. S., Dyal P. L., Bell S., Foster P. G. Hydrogenosomes, mitochondria and early eukaryotic evolution. IUBMB Life (2003) 55:387–395.[Web of Science][Medline]
Fast N. M., Kissinger J. C., Roos D. S., Keeling P. J. Nuclear-encoded, plastid-targeted genes suggest a single common origin for apicomplexan and dinoflagellate plastids. Mol. Biol. Evol. (2001) 18:418–426.
Felsenstein J. Cases in which parsimony or compatibility methods will be positively misleading. Syst. Zool. (1978) 27:401–410.
Felsenstein J. Inferring phylogenies. (2004) Sunderland, Massachusetts: Sinauer Associates.
Forterre P., Philippe H. Where is the root of the universal tree of life? BioEssays (1999) 21:871–879.[CrossRef][Web of Science][Medline]
Foster P. G., Hickey D. A. Compositional bias may affect both DNA-based and protein-based phylogenetic reconstructions. J. Mol. Evol. (1999) 48:284–290.[CrossRef][Web of Science][Medline]
Galtier N. Maximum-likelihood phylogenetic analysis under a covarion-like model. Mol. Biol. Evol. (2001) 18:866–873.
Gascuel O. BIONJ: An improved version of the NJ algorithm based on a simple model of sequence data. Mol. Biol. Evol. (1997) 14:685–695.[Abstract]
Germot A., Philippe H., Le Guyader H. Evidence for loss of mitochondria in Microsporidia from a mitochondrial-type HSP70 in Nosema locustae. Mol. Biochem. Parasitol. (1997) 87:159–168.[CrossRef][Web of Science][Medline]
Gibbs S. P. The chloroplasts of some algal groups may have evolved from endosymbiotic eukaryotic algae. Ann NY Acad. Sci. (1981) 361:193–208.[Web of Science][Medline]
Grassly N. C., Adachi J., Rambaut A. PSeq-Gen: An application for the Monte Carlo simulation of protein sequence evolution along phylogenetic trees. Comput. Appl. Biosci. (1997) 13:559–560.
Graur D., Higgins D. G. Molecular evidence for the inclusion of cetaceans within the order Artiodactyla. Mol. Biol. Evol. (1994) 11:357–364.[Abstract]
Guindon S., Gascuel O. A simple, fast, and accurate algorithm to estimate large phylogenies by maximum likelihood. Syst. Biol. (2003) 52:696–704.
Hampl V., Cepicka I., Flegr J., Tachezy J., Kulda J. Critical analysis of the topology and rooting of the parabasalian 16S rRNA tree. Mol. Phylogenet. Evol. (2004) 32:711–723.[CrossRef][Web of Science][Medline]
Hasegawa M., Fujiwara M. Relative efficiencies of the maximum likelihood, maximum parsimony, and neighbor-joining methods for estimating protein phylogeny. Mol. Phylogenet. Evol. (1993) 2:1–5.[CrossRef][Medline]
Hendy M., Penny D. A framework for the quantitative study of evolutionary trees. Syst. Zool. (1989) 38:297–309.
Hillis D. M., Pollock D. D., McGuire J. A., Zwickl D. J. Is sparse taxon sampling a problem for phylogenetic inference? Syst. Biol. (2003) 52:124–126.
Huelsenbeck J. P. Is the Felsenstein zone a fly trap? Syst. Biol. (1997) 46:69–74.
Huelsenbeck J. P. Systematic bias in phylogenetic analysis: Is the Strepsiptera problem solved? Syst. Biol. (1998) 47:519–537.[Web of Science][Medline]
Huelsenbeck J. P. Testing a covariotide model of DNA substitution. Mol. Biol. Evol. (2002) 19:698–707.
Inagaki Y., Susko E., Fast N. M., Roger A. J. Covarion shifts cause a long-branch attraction artifact that unites microsporidia and archaebacteria in EF-1
phylogenies. Mol. Biol. Evol. (2004) 21:1340–1349.
James T. Y., Porter D., Leander C. A., Vilgalys R., Longcore J. E. Molecular phylogenetics of the Chytridiomycota support the utility of ultrastructural data in chytrid systematics. Can. J. Bot. (2000) 78:336–350.[CrossRef]
Jones D. T., Taylor W. R., Thornton J. M. The rapid generation of mutation data matrices from protein sequences. Comput. Appl. Biosci. (1992) 8:275–282.
Katinka M. D., Duprat S., Cornillot E., Metenier G., Thomarat F., Prensier G., Barbe V., Peyretaillade E., Brottier P., Wincker P., Delbac F., El Alaoui H., Peyret P., Saurin W., Gouy M., Weissenbach J., Vivares C. P. Genome sequence and gene compaction of the eukaryote parasite Encephalitozoon cuniculi. Nature (2001) 414:450–453.[CrossRef][Medline]
Keeling P. J. Congruent evidence from alpha-tubulin and beta-tubulin gene phylogenies for a zygomycete origin of microsporidia. Fungal Genet. Biol. (2003) 38:298–309.[CrossRef][Web of Science][Medline]
Keeling P. J., Fast N. M. Microsporidia: Biology and evolution of highly reduced intracellular parasites. Annu. Rev. Microbiol. (2002) 56:93–116.[CrossRef][Web of Science][Medline]
Kim J. General inconsistency conditions for maximum parsimony: Effects of branch lengths and increasing numbers of taxa. Syst. Biol. (1996) 45:363–374.
King N., Hittinger C. T., Carroll S. B. Evolution of key cell signaling and adhesion protein families predates animal origins. Science (2003) 301:361–336.
Kishino H., Miyata T., Hasegawa M. Maximum likelihood inference of protein phylogeny, and the origin of chloroplasts. J. Mol. Evol. (1990) 31:151–160.[CrossRef][Web of Science]
Kolaczkowski B., Thornton J. W. Performance of maximum parsimony and likelihood phylogenetics when evolution is heterogeneous. Nature (2004) 431:980–984.[CrossRef][Medline]
Kuhner M. K., Felsenstein J. A simulation comparison of phylogeny algorithms under equal and unequal evolutionary rates. Mol. Biol. Evol. (1994) 11:459–468.[Abstract]
Lake J. A., Rivera M. C. Was the nucleus the first endosymbiont? Proc. Natl. Acad. Sci. USA (1994) 91:2880–2881.
Lang B. F., O'Kelly C., Nerad T., Gray M. W., Burger G. The closest unicellular relatives of animals. Curr. Biol. (2002) 12:1773–1778.[CrossRef][Web of Science][Medline]
Lartillot N., Philippe H. A Bayesian mixture model for across-site heterogeneities in the amino-acid replacement process. Mol. Biol. Evol. (2004) 21:1095–1109.
Lecointre G., Philippe H., Le H. L. V., Le Guyader H. Species sampling has a major impact on phylogenetic inference. Mol. Phylogenet. Evol. (1993) 2:205–224.[CrossRef][Medline]
Lerat E., Daubin V., Moran N. A. From gene trees to organismal phylogeny in prokaryotes: The case of the gamma-Proteobacteria. PLoS Biol. (2003) 1:E19.[Medline]
Lockhart P. J., Larkum A. W., Steel M., Waddell P. J., Penny D. Evolution of chlorophyll and bacteriochlorophyll: The problem of invariant sites in sequence analysis. Proc. Natl. Acad. Sci. USA (1996) 93:1930–1934.
Lopez-Garcia P., Moreira D. Metabolic symbiosis at the origin of eukaryotes. Trends Biochem. Sci. (1999) 24:88–93.[CrossRef][Web of Science][Medline]
Madsen O., Scally M., Douady C. J., Kao D. J., DeBry R. W., Adkins R., Amrine H. M., Stanhope M. J., de Jong W. W., Springer M. S. Parallel adaptive radiations in two major clades of placental mammals. Nature (2001) 409:610–614.[CrossRef][Medline]
Martin W., Müller M. The hydrogen hypothesis for the first eukaryote. Nature (1998) 392:37–41.[CrossRef][Medline]
Murphy W. J., Eizirik E., O'Brien S. J., Madsen O., Scally M., Douady C., Teeling E., Ryder O. A., Stanhope M. J., de Jong W. W., Springer M. S. Resolution of the early placental mammal radiation using Bayesian phylogenetics. Science (2001) 294:2348–2351.
Nozaki H., Matsuzaki M., Takahara M., Misumi O., Kuroiwa H., Hasegawa M., Shin I. T., Kohara Y., Ogasawara N., Kuroiwa T. The phylogenetic position of red algae revealed by multiple nuclear genes from mitochondria-containing eukaryotes and an alternative hypothesis on the origin of plastids. J. Mol. Evol. (2003) 56:485–497.[CrossRef][Web of Science][Medline]
Pagel M., Meade A. A phylogenetic mixture model for detecting pattern-heterogeneity in gene sequence or character-state data. Syst. Biol. (2004) 53:571–581.
Philip G. K., Creevey C. J., McInerney J. O. The Opisthokonta and the Ecdysozoa may not be clades: Stronger support for the grouping of plant and animal than for animal and fungi and stronger support for the Coelomata than Ecdysozoa. Mol. Biol. Evol. (2005) 22:1175–1184.
Philippe H. Rodent monophyly: Pitfalls of molecular phylogenies. J. Mol. Evol. (1997) 45:712–715.[Web of Science][Medline]
Philippe H., Douzery E. The pitfalls of molecular phylogeny based on four species, as illustrated by the Cetacea/Artiodactyla relationships. J. Mamm. Evol. (1994) 2:133–152.[CrossRef]
Philippe H., Germot A. Phylogeny of eukaryotes based on ribosomal RNA: Long-branch attraction and models of sequence evolution. Mol. Biol. Evol. (2000) 17:830–834.
Philippe H., Germot A., Moreira D. The new phylogeny of eukaryotes. Curr. Opin. Genet. Dev. (2000a) 10:596–601.[CrossRef][Web of Science][Medline]
Philippe H., Lartillot N., Brinkmann H. Multigene analyses of bilaterian animals corroborate the monophyly of Ecdysozoa, Lophotrochozoa and Protostomia. Mol. Biol. Evol. (2005) 22:1246–1253.
Philippe H., Laurent J. How good are deep phylogenetic trees? Curr. Opin. Genet. Dev. (1998) 8:616–623.[CrossRef][Web of Science][Medline]
Philippe H., Lopez P., Brinkmann H., Budin K., Germot A., Laurent J., Moreira D., Müller M., Le Guyader H. Early branching or fast evolving eukaryotes? An answer based on slowly evolving positions. Philos. Trans. R. Soc. Lond. B. Biol. Sci. (2000b) 267:1213–1221.[CrossRef]
Philippe H., Snell E. A., Bapteste E., Lopez P., Holland P. W., Casane D. Phylogenomics of eukaryotes: Impact of missing data on large alignments. Mol. Biol. Evol. (2004) 21:1740–1752.
Philippe H., Sörhannus U., Baroin A., Perasso R., Gasse F., Adoutte A. Comparison of molecular and paleontological data in diatoms suggests a major gap in the fossil record. J. Evol. Biol. (1994) 7:247–265.[CrossRef][Web of Science]
Phillips M. J., Delsuc F., Penny D. Genome-scale phylogeny and the detection of systematic biases. Mol. Biol. Evol. (2004) 21:1455–1458.
Pisani D. Identifying and removing fast-evolving sites using compatibility analysis: An example from the arthropoda. Syst. Biol. (2004) 53:978–989.
Poe S. Evaluation of the strategy of long-branch subdivision to improve the accuracy of phylogenetic methods. Syst. Biol. (2003) 52:423–428.
Poole A., Jeffares D., Penny D. Early evolution: Prokaryotes, the new kids on the block. Bioessays (1999) 21:880–889.[CrossRef][Web of Science][Medline]
Qiu Y. L., Lee J., Bernasconi-Quadroni F., Soltis D. E., Soltis P. S., Zanis M., Zimmer E. A., Chen Z., Savolainen V., Chase M. W. The earliest angiosperms: Evidence from mitochondrial, plastid and nuclear genomes. Nature (1999) 402:404–407.[CrossRef]
Qiu Y. L., Lee J., Whitlock B. A., Bernasconi-Quadroni F., Dombrovska O. Was the ANITA rooting of the angiosperm phylogeny affected by long-branch attraction? Amborella, Nymphaeales, Illiciales, Trimeniaceae, and Austrobaileya. Mol Biol Evol (2001) 18:1745–1753.
Rodríguez-Ezpeleta N., Brinkmann H., Burey S. C., Roure B., Burger G., Löeffelhardt W., Bohnert H. J., Philippe H., Lang B. F. Monophyly of primary photosynthetic eukaryotes: Green plants, red algae and glaucophytes. Current Biology (2005) 15:1325–1330.[CrossRef][Web of Science][Medline]
Rokas A., Williams B. L., King N., Carroll S. B. Genome-scale approaches to resolving incongruence in molecular phylogenies. Nature (2003) 425:798–804.[CrossRef][Medline]
Ronquist F., Huelsenbeck J. P. MrBayes 3: Bayesian phylogenetic inference under mixed models. Bioinformatics (2003) 19:1572–1574.
Rosenberg M. S., Kumar S. Taxon sampling, bioinformatics, and phylogenomics. Syst. Biol. (2003) 52:119–124.
Salter L. A. Complexity of the likelihood surface for a large DNA dataset. Syst Biol (2001) 50:970–978.
Sanderson M. J., Wojciechowski M. F., Hu J., Khan T. S., Brady S. G. Error, bias, and long-branch attraction in data for two chloroplast photosystem genes in seed plants. Mol. Biol. Evol. (2000) 17:782–797.
Schmidt H. A., Strimmer K., Vingron M., von Haeseler A. TREE-PUZZLE: Maximum likelihood phylogenetic analysis using quartets and parallel computing. Bioinformatics (2002) 18:502–504.
Shimodaira H., Hasegawa M. Multiple comparisons of log-likelihoods with applications to phylogenetic inference. Mol. Biol. Evol. (1999) 16:1114–1116.[Web of Science]
Shimodaira H., Hasegawa M. CONSEL: For assessing the confidence of phylogenetic tree selection. Bioinformatics (2001) 17:1246–1247.
Simpson A. G., Roger A. J., Silberman J. D., Leipe D. D., Edgcomb V. P., Jermiin L. S., Patterson D. J., Sogin M. L. Evolutionary history of "early-diverging" eukaryotes: The excavate taxon Carpediemonas is a close relative of Giardia. Mol. Biol. Evol. (2002) 19:1782–1791.
Soltis P. S., Soltis D. E., Chase M. W. Angiosperm phylogeny inferred from multiple genes as a tool for comparative biology. Nature (1999) 402:402–404.[CrossRef]
Stechmann A., Cavalier-Smith T. Rooting the eukaryote tree by using a derived gene fusion. Science (2002) 297:89–91.
Stiller J., Hall B. Long-branch attraction and the rDNA model of early eukaryotic evolution. Mol. Biol. Evol. (1999) 16:1270–1279.[Web of Science][Medline]
Sullivan J., Swofford D. L. Should we use model-based methods for phylogenetic inference when we know that assumptions about among-site rate variation and nucleotide substitution pattern are violated? Syst. Biol. (2001) 50:723–729.
Swofford D. L. PAUP*: Phylogenetic analysis using parsimony and other methods, version 4b10. (2000) Sunderland, Massachusetts: Sinauer Associates.
Swofford D. L., Waddell P. J., Huelsenbeck J. P., Foster P. G., Lewis P. O., Rogers J. S. Bias in phylogenetic estimation and its relevance to the choice between parsimony and likelihood methods. Syst. Biol. (2001) 50:525–539.
Teunissen M. J., Op den Camp H. J. Anaerobic fungi and their cellulolytic and xylanolytic enzymes. Antonie Van Leeuwenhoek (1993) 63:63–76.[CrossRef][Web of Science][Medline]
Whelan S., Goldman N. A general empirical model of protein evolution derived from multiple protein families using a maximum-likelihood approach. Mol. Biol. Evol. (2001) 18:691–699.
Wiens J. J. Does adding characters with missing data increase or decrease phylogenetic accuracy? Syst. Biol. (1998) 47:625–640.
Wiens J. J. Missing data, incomplete taxa, and phylogenetic accuracy. Syst. Biol. (2003) 52:528–538.
Wiens J. J. Can incomplete taxa rescue phylogenetic analyses from long-branch attraction? Syst. Biol. (2005) 731–742.
Wolf Y. I., Rogozin I. B., Koonin E. V. Coelomata and not Ecdysozoa: Evidence from genome-wide phylogenetic analysis. Genome. Res. (2004) 14:29–36.
Xue G. P., Orpin C. G., Gobius K. S., Aylward J. H., Simpson G. D. Cloning and expression of multiple cellulase cDNAs from the anaerobic rumen fungus Neocallimastix patriciarum in Escherichia coli. J. Gen. Microbiol. (1992) 138:1413–1420.
Yang Z. Maximum-likelihood estimation of phylogeny from DNA sequences when substitution rates differ over sites. Mol. Biol. Evol. (1993) 10:1396–1401.[Abstract]
Yang Z. Maximum-likelihood models for combined analyses of multiple sequence data. J. Mol. Evol. (1996) 42:587–596.[CrossRef][Web of Science][Medline]
Yang Z. How often do wrong models produce better phylogenies? Mol. Biol. Evol. (1997a) 144:105–108.
Yang Z. PAML: A program package for phylogenetic analysis by maximum likelihood. Comput. Appl. Biosci. (1997b) 13:555–556.
Yoon H. S., Hackett J. D., Pinto G., Bhattacharya D. The single, ancient origin of chromist plastids. Proc. Natl. Acad. Sci. USA (2002) 99:15507–15512.
This article has been cited by other articles:
![]() |
S. Simon, S. Strauss, A. von Haeseler, and H. Hadrys A Phylogenomic Approach to Resolve the Basal Pterygote Divergence Mol. Biol. Evol., December 1, 2009; 26(12): 2719 - 2730. [Abstract] [Full Text] [PDF] |
||||
![]() |
D. A. Caron, P. D. Countway, P. Savai, R. J. Gast, A. Schnetzer, S. D. Moorthi, M. R. Dennett, D. M. Moran, and A. C. Jones Defining DNA-Based Operational Taxonomic Units for Microbial-Eukaryote Ecology Appl. Envir. Microbiol., September 15, 2009; 75(18): 5797 - 5808. [Abstract] [Full Text] [PDF] |
||||
![]() |
J. M. Archibald and C. E. Lane Going, Going, Not Quite Gone: Nucleomorphs as a Case Study in Nuclear Genome Reduction J. Hered., September 1, 2009; 100(5): 582 - 590. [Abstract] [Full Text] [PDF] |
||||
![]() |
T. A. Castoe, A. P. J. de Koning, H.-M. Kim, W. Gu, B. P. Noonan, G. Naylor, Z. J. Jiang, C. L. Parkinson, and D. D. Pollock From the Cover: Evidence for an ancient adaptive episode of convergent molecular evolution PNAS, June 2, 2009; 106(22): 8986 - 8991. [Abstract] [Full Text] [PDF] |
||||
![]() |
S. V. Edwards Natural selection and phylogenetic analysis PNAS, June 2, 2009; 106(22): 8799 - 8800. [Full Text] [PDF] |
||||
![]() |
Y. Inagaki, Y. Nakajima, M. Sato, M. Sakaguchi, and T. Hashimoto Gene Sampling Can Bias Multi-Gene Phylogenetic Inferences: The Relationship between Red Algae and Green Plants as a Case Study Mol. Biol. Evol., May 1, 2009; 26(5): 1171 - 1178. [Abstract] [Full Text] [PDF] |
||||
![]() |
V. Hampl, L. Hug, J. W. Leigh, J. B. Dacks, B. F. Lang, A. G. B. Simpson, and A. J. Roger Phylogenomic analyses support the monophyly of Excavata and resolve relationships among eukaryotic "supergroups" PNAS, March 10, 2009; 106(10): 3859 - 3864. [Abstract] [Full Text] [PDF] |
||||
![]() |
L. Si Quang, O. Gascuel, and N. Lartillot Empirical profile mixture models for phylogenetic reconstruction Bioinformatics, October 15, 2008; 24(20): 2317 - 2323. [Abstract] [Full Text] [PDF] |
||||
![]() |
N. Wahlberg and C. W. Wheat Genomic Outposts Serve the Phylogenomic Pioneers: Designing Novel Nuclear Markers for Genomic DNA Extractions of Lepidoptera Syst Biol, April 1, 2008; 57(2): 231 - 242. [Abstract] [Full Text] [PDF] |
||||
![]() |
S. W. Roy and M. Irimia Rare Genomic Characters Do Not Support Coelomata: Intron Loss/Gain Mol. Biol. Evol., April 1, 2008; 25(4): 620 - 623. [Abstract] [Full Text] [PDF] |
||||
![]() |
C. E. Lane, K. van den Heuvel, C. Kozera, B. A. Curtis, B. J. Parsons, S. Bowman, and J. M. Archibald Nucleomorph genome of Hemiselmis andersenii reveals complete intron loss and compaction as a driver of protein structure and function PNAS, December 11, 2007; 104(50): 19908 - 19913. [Abstract] [Full Text] [PDF] |
||||
![]() |
J. Zheng, I. B. Rogozin, E. V. Koonin, and T. M. Przytycka Support for the Coelomata Clade of Animals from a Rigorous Analysis of the Pattern of Intron Conservation Mol. Biol. Evol., November 1, 2007; 24(11): 2583 - 2592. [Abstract] [Full Text] [PDF] |
||||
![]() |
C. Grauvogel, H. Brinkmann, and J. Petersen Evolution of the Glucose-6-Phosphate Isomerase: The Plasticity of Primary Metabolism in Photosynthetic Eukaryotes Mol. Biol. Evol., August 1, 2007; 24(8): 1611 - 1621. [Abstract] [Full Text] [PDF] |
||||
![]() |
N. Rodriguez-Ezpeleta, H. Brinkmann, B. Roure, N. Lartillot, B. F. Lang, and H. Philippe Detecting and Overcoming Systematic Errors in Genome-Scale Phylogenies Syst Biol, June 1, 2007; 56(3): 389 - 399. [Abstract] [Full Text] [PDF] |
||||
![]() |
R. B. Bevan, D. Bryant, and B. F. Lang Accounting for Gene Rate Heterogeneity in Phylogenetic Inference Syst Biol, April 1, 2007; 56(2): 194 - 205. [Abstract] [Full Text] [PDF] |
||||
![]() |
I. B. Rogozin, Y. I. Wolf, L. Carmel, and E. V. Koonin Ecdysozoan Clade Rejected by Genome-Wide Analysis of Rare Amino Acid Replacements Mol. Biol. Evol., April 1, 2007; 24(4): 1080 - 1090. [Abstract] [Full Text] [PDF] |
||||
![]() |
N. Rodriguez-Ezpeleta, H. Philippe, H. Brinkmann, B. Becker, and M. Melkonian Phylogenetic Analyses of Nuclear, Mitochondrial, and Plastid Multigene Data Sets Support the Placement of Mesostigma in the Streptophyta Mol. Biol. Evol., March 1, 2007; 24(3): 723 - 731. [Abstract] [Full Text] [PDF] |
||||
![]() |
V. Ruano-Rubio and M. A. Fares Artifactual Phylogenies Caused by Correlated Distribution of Substitution Rates among Sites and Lineages: The Good, the Bad, and the Ugly Syst Biol, February 1, 2007; 56(1): 68 - 82. [Abstract] [Full Text] [PDF] |
||||
![]() |
A. J Roger and L. A Hug The origin and diversification of eukaryotes: problems with molecular phylogenetics and molecular clock estimation Phil Trans R Soc B, June 29, 2006; 361(1470): 1039 - 1054. [Abstract] [Full Text] [PDF] |
||||
![]() |
N. Lartillot and H. Philippe Computing Bayes Factors Using Thermodynamic Integration Syst Biol, April 1, 2006; 55(2): 195 - 207. [Abstract] [Full Text] [PDF] |
||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||













