Skip Navigation

Systematic Biology 2005 54(5):743-757; doi:10.1080/10635150500234609
This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (PDF) Freely available
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in ISI Web of Science
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrow Search for citing articles in:
ISI Web of Science (49)
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Brinkmann, H.
Right arrow Articles by Philippe, H.
Right arrow Search for Related Content
PubMed
Right arrow Articles by Brinkmann, H.
Right arrow Articles by Philippe, H.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?

© 2005 Society of Systematic Biologists

An Empirical Assessment of Long-Branch Attraction Artefacts in Deep Eukaryotic Phylogenomics

Edited by Marshal Hedin

Henner Brinkmann1, Mark van der Giezen2, Yan Zhou1, Gaëtan Poncelin de Raucourt1 and Hervé Philippe1

1 Canadian Institute for Advanced Research, Centre Robert Cedergren, Département de Biochimie, Université de Montréal, Succursale Centre-Ville Montréal, Québec H3C3J7, Canada; E-mail: herve.philippe{at}umontreal.ca (H.P.)
2 School of Biological and Chemical Sciences, Queen Mary, University of London Mile End Road, E1 4NS, London, UK


    Abstract
 Top
 Abstract
 Materials and Methods
 Results and Discussion
 Conclusion
 References
 
In the context of exponential growing molecular databases, it becomes increasingly easy to assemble large multigene data sets for phylogenomic studies. The expected increase of resolution due to the reduction of the sampling (stochastic) error is becoming a reality. However, the impact of systematic biases will also become more apparent or even dominant. We have chosen to study the case of the long-branch attraction artefact (LBA) using real instead of simulated sequences. Two fast-evolving eukaryotic lineages, whose evolutionary positions are well established, microsporidia and the nucleomorph of cryptophytes, were chosen as model species. A large data set was assembled (44 species, 133 genes, and 24,294 amino acid positions) and the resulting rooted eukaryotic phylogeny (using a distant archaeal outgroup) is positively misled by an LBA artefact despite the use of a maximum likelihood–based tree reconstruction method with a complex model of sequence evolution. When the fastest evolving proteins from the fast lineages are progressively removed (up to 90%), the bootstrap support for the apparently artefactual basal placement decreases to virtually 0%, and conversely only the expected placement, among all the possible locations of the fast-evolving species, receives increasing support that eventually converges to 100%. The percentage of removal of the fastest evolving proteins constitutes a reliable estimate of the sensitivity of phylogenetic inference to LBA. This protocol confirms that both a rich species sampling (especially the presence of a species that is closely related to the fast-evolving lineage) and a probabilistic method with a complex model are important to overcome the LBA artefact. Finally, we observed that phylogenetic inference methods perform strikingly better with simulated as opposed to real data, and suggest that testing the reliability of phylogenetic inference methods with simulated data leads to overconfidence in their performance. Although phylogenomic studies can be affected by systematic biases, the possibility of discarding a large amount of data containing most of the nonphylogenetic signal allows recovering a phylogeny that is less affected by systematic biases, while maintaining a high statistical support.

Keywords: Distant outgroup; eukaryotic tree; long-branch attraction; microsporidia; multigene data sets; nucleomorph; rooting; species sampling; systematic biases

Received November 5, 2004; Revised February 7, 2005; Accepted April 30, 2005


Single-gene phylogenies are generally poorly resolved because the number of informative positions is limited and stochastic (random) noise yields contradictory, yet often poorly supported, results. Phylogenomics, that is the use of a large number of genes, or ultimately of complete genomes, in phylogenetic inference, is of great promise to overcome stochastic errors and to furnish statistically significant results. Recently, the analysis of several large data sets has allowed enhanced insight into long-term outstanding questions such as relationships of placental mammals (Madsen et al., 2001; Murphy et al., 2001) and angiosperms (Qiu et al., 1999; Soltis et al., 1999). However, conflicting results have also emerged. For example, the monophyly of Ecdysozoa (nematodes + arthropods) is strongly rejected by some phylogenomic analyses (Blair et al., 2002; Philip et al., 2005; Wolf et al., 2004) and strongly supported by others (Delsuc et al., 2005; Philippe et al., 2005).

The use of large data sets reduces the impact of the stochastic error (which will disappear only with infinite samples); however, it can exacerbate systematic errors, which can eventually become dominant. Systematic errors occur when the real evolutionary process differs from our oversimplified models (Phillips et al., 2004). They may also be found in the case of single genes, but are usually hidden by sampling errors. Although probabilistic methods like maximum likelihood (ML) or Bayesian approaches are known to be more robust to model violations (Hasegawa and Fujiwara, 1993; Sullivan and Swofford, 2001), heterotachy, defined as the heterogeneity of the evolutionary rate of a given position throughout time and compositional bias, can lead to inconsistency (Foster and Hickey, 1999; Inagaki et al., 2004; Kolaczkowski and Thornton, 2004; Lockhart et al., 1996; Philippe and Germot, 2000). For example, the minimum evolution method is inconsistent in the case of a large yeast data set of Rokas et al. (2003) because two unrelated species share a similar nucleotide composition. This can be corrected, however, by RY coding (Phillips et al., 2004).

Variable evolutionary rates among lineages constitute an important source of systematic bias. The long-branch attraction (LBA) artefact posits that the two longest branches will cluster together under certain conditions, irrespective of the true relationships of the sequences under study (Felsenstein, 1978). In the case of a distant outgroup (representing a long branch), LBA leads to the artefactual early emergence of the fast-evolving lineages of the ingroup (Philippe and Laurent, 1998). Although LBA artefacts were suspected to be present in various phylogenies (Bapteste et al., 2002; Dacks et al., 2002; Huelsenbeck, 1997; Nozaki et al., 2003; Qiu et al., 2001; Sanderson et al., 2000; Simpson et al., 2002; Stiller and Hall, 1999), they are difficult to discover and overcome (see the case of glires, Douzery et al., 2004). The most obvious way would be the use of a tree reconstruction method that is not sensitive to this artefact, but, unfortunately such a method does not yet exist. Probabilistic methods fail because the current models (even the most complex ones) do not reflect all facets of biological reality, not because of the method per se (Felsenstein, 2004; Lockhart et al., 1996). Simulation studies (Guindon and Gascuel, 2003; Huelsenbeck, 1998; Kuhner and Felsenstein, 1994; Qiu et al., 2001; Swofford et al., 2001; Wolf et al., 2004) have revealed that maximum parsimony (MP) is generally more sensitive than distance-based methods, whereas probabilistic methods are generally more robust. The different sensitivity of MP and probabilistic methods can help to detect if the LBA artefact is playing a major role (Germot et al., 1997; Huelsenbeck, 1998).

However, if all methods yield trees where long branches, such as fast-evolving species and outgroup, are clustered, the situation becomes much more complex. One possibility is to modify the taxonomic sampling so that only the slowest evolving species are included (Aguinaldo et al., 1997). Alternatively, the addition of species can alleviate the LBA artefact by dividing long internal branches (Hendy and Penny, 1989). In this case, the addition of slowly evolving species is much more efficient, whereas the addition of fast-evolving species makes things worse (Kim, 1996; Poe, 2003). Although the most efficient conditions of species addition are not known (Hillis et al., 2003; Rosenberg and Kumar, 2003), several cases of LBA were revealed by adding species (Anderson and Swofford, 2004; Dacks et al., 2001; Inagaki et al., 2004; Philippe, 1997).

Finally, when the species sampling is reasonable for a given phylogenetic problem, the removal of sequence positions can be an effective method. The fast-evolving positions, which are saturated by multiple substitutions, have lost much, if not all, of their phylogenetic signal and are especially sensitive to any systematic bias. The slow/fast (SF) method (Brinkmann and Philippe, 1999), which starts by selecting the slowest evolving positions, and then progressively adding faster evolving positions, can reveal a transition between a topology in which the long branches are not grouped and a topology dictated by the LBA artefact (Brinkmann and Philippe, 1999; Brochier and Philippe, 2002; Busse and Preisfeld, 2003; Delsuc et al., 2005; Hampl et al., 2004; Philippe et al., 2000b).

Rooting deep level phylogenies is of fundamental importance in understanding the origin of numerous groups, eukaryotes in particular (Forterre and Philippe, 1999; Lake and Rivera, 1994; Lopez-Garcia and Moreira, 1999; Martin and Müller, 1998; Poole et al., 1999). Because many groups only have a distantly related outgroup (e.g., marsupials versus placental mammals, gnetales/ gymnosperms versus angiosperms, Archaea versus eukaryotes), the probability of the erroneous early emergence of fast-evolving lineages is high when multiple genes are used. One can therefore legitimately ask the question: is it possible to confidently root deep level trees in a phylogenomic analysis, or in other words, to eschew the LBA artefact in the presence of a distant outgroup?

In this article, we tackle this question by studying a situation in which the phylogenetic position of two fast-evolving lineages is well-established a priori. We selected the eukaryotic phylogeny (Fig. 1) because the archaeal sequences represent a distantly related outgroup that should strongly attract any fast evolving eukaryotes. Two fast-evolving eukaryotes, the nucleomorph of the cryptophyte Guillardia theta, and the microsporidium Encephalitozoon cuniculi, were selected because their complete genomes had been sequenced (Douglas et al., 2001; Katinka et al., 2001). The nucleomorph originated in a secondary endosymbiotic event in which an entire red alga was engulfed by a flagellate host cell, and corresponds therefore to the remnant of the former red algal nucleus, which is now highly reduced. This interpretation is supported by phylogenetic data from the corresponding chloroplast genome (Douglas et al., 2001; Yoon et al., 2002) and by morphological characters (Gibbs, 1981). The position of microsporidia has been more controversial, but now a large body of evidence argues that microsporidia are closely related to fungi (Keeling and Fast, 2002), although their exact position within fungi remains uncertain (Keeling, 2003). To include the chytridiomycetes, an important group of fungi, we sequenced ~ 1000 ESTs from Neocallimastix patriciarum. N. patriciarum is an anaerobic fungus that can be found in the digestive tract of herbivorous mammals, in both ruminants and nonruminants (Teunissen and Op den Camp, 1993). Interestingly, this organism does not possess classical aerobic mitochondria, but rather hydrogen-producing organelles called hydrogenosomes. Hydrogenosomes are modified mitochondria that completely lost their genome and respiratory functions (reviewed in Embley et al., 2003). The group chytridiomycetes, to which this organism belongs, is characterized by the presence of a flagellum, a unique property within fungi. For this reason, it is generally assumed that chytridiomycetes have a basal position within fungi (James et al., 2000). We assembled a large data set of 133 nuclear encoded genes from six archaeal outgroup and 33 slow- and 2 fast-evolving eukaryotic ingroup species, including the microsporidium Encephalitozoon cuniculi and the nucleomorph of the cryptophyte Guillardia theta. Because the two fast-evolving species were misplaced in preliminary analyses, four different approaches were used to study LBA artefacts: (1) the removal of the fastest evolving proteins, (2) the use of various tree reconstruction methods, (3) the use of diverse taxon samplings, and (4) phylogenetic inference without the distant outgroup.


Figure 1
View larger version (12K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Figure 1 Eukaryotic tree, rooted according to Philippe et al. (2000b) and Stechmann and Cavalier-Smith (2002), showing the expected position of the two fast-evolving eukaryotic species, the microsporidia Encephalitozoon and the nucleomorph of the cryptophyte algae Guillardia, in the presence of the distantly related outgroup Archaea. The topology of the tree is a consensus emerging from several multigene analyses (Baldauf et al., 2000; Lang et al., 2002; Philippe et al., 2004). This tree illustrates our working hypothesis, which we will test, and the high evolutionary rate of nucleomorph and microsporidia. The branch lengths were inferred by Tree-Puzzle (WAG+F+{Gamma}4) based on the complete data set with 41 species and 24,294 amino acid positions. The scale bar corresponds to 0.1 amino acid substitutions per site.

 

    Materials and Methods
 Top
 Abstract
 Materials and Methods
 Results and Discussion
 Conclusion
 References
 
Neocallimastix ESTs
Sequences were obtained from a previously constructed Neocallimastix patriciarum ZAP II cDNA library (Xue et al., 1992). An aliquot of this library containing a random collection of clones was excised by superinfection with helper phages according to the manufacturer's instructions (Stratagene). One thousand clones were randomly selected and subsequently analyzed by sequencing. A detailed description of the sequences will be provided elsewhere.

Assembling the Alignment
We added to the aligned data sets of 174 proteins used by Philippe et al. (2004) the amino acid sequences available in Genbank (nonredundant section) on December 2003, using a BLASTP search with a cutoff e-value corresponding to the highest value of the orthologous proteins in Archaea. We then added to the alignments the EST sequences from the chytridiomycete N. patriciarum, and EST, as well as genomic sequences, from several ongoing sequencing projects. We retrieved most of the sequences from GenBank through NCBI (http://www.ncbi.nlm.nih.gov) except for Cryptococcus neoformans(C. neoformans cDNA Sequencing Project at http://www.genome.ou.edu/cneo.html; and C. neoformans Genome Project, Stanford Genome Technology Center and the Institute for Genomic Research, at http://baggage.stanford.edu/group/C.neoformans/download.html), Dictyostelium discoideum (Genome Sequencing Center Jena website at http://genome.ibm-jena.de/ dictyostelium), Thalassiosira pseudonana (http://genome.jgi-psf.org/thaps1/thaps1.download.ftp.html), Phytophthora sojae (http://genome.jgi-psf.org/sojae1/sojae1.download.ftp.html), Tetrahymena thermophila (ftp://ftp.tigr.org/pub/data/Eukaryotic_Projects/t_thermophila/), and Monosiga brevicollis (http://projects.bocklabs.wisc.edu/carroll/choano/, King et al., 2003).

The sequences were added as described in Philippe et al. (2004). To deal with the problem of nonorthologous sequences, we constructed amino acid based phylogenies (MP and ML) starting with the original 174 proteins, of which the 133 proteins used to assemble our final phylogenomic data set represent a conservative subsample. At this step we also eliminated all proteins that had either too few species or too much missing data. The reliability of orthology assignment was greatly improved due to the use of numerous species. Genes for which orthology relationships were difficult to establish (e.g., EF-1{alpha} or cytosolic HSP70) were completely discarded from the analyses. When recent gene duplications were detected (almost exclusively for vertebrates), the slowest evolving gene copy was selected. We did not find in our individual gene data sets any case in which horizontal gene transfers would provide a reasonable explanation.

To assemble a data set rich in both species and genes, sequences can be missing or partial for some proteins from some species, because we compiled the sequences mainly from cDNA sequencing projects. To decrease the amount of missing data, we created chimerical sequences between closely related taxa (see Appendix 1, available at www.systematicbiology.org). We retained only species for which a sufficiently large number of amino acid residues were available (larger than 5000). Simulation studies have shown that under these conditions the impact of missing data is negligible (Philippe et al., 2004; Wiens, 2003). Moreover, the removal of the most incomplete taxa has no visible effect on the phylogenetic inference (Philippe et al., 2005).

In order to extract only unambiguously aligned portions and to eliminate divergent regions of the alignment, we used Gblocks (Castresana, 2000) with the following parameter settings: a minimum of 50% of the sequences per position identical for a conserved position, a minimum of 75% of the sequences identical for a flanking position, a maximum of five contiguous nonconserved positions, and a minimum of five positions for a block. This selection was manually verified; in particular, a few conserved regions with some amount of missing data, for which Gblocks was too stringent, were reintroduced into the dataset. A data set comprising 44 species (six Archaea, 33 slowly evolving eukaryotes, a microsporidium, a nucleomorph, and three kinetoplastids) and 133 genes (displaying a mean of ~ 24% of missing data per species, Appendix 2, available at www.systematicbiology.org) was constructed. In a few cases the amount of missing data is quite high, with a maximum of 80% for the brown alga Laminaria. However, there are only seven species with less than 10,000 amino acid positions, and they are always closely related to almost complete species, so that no major eukaryotic lineages are only represented by highly incomplete taxa. The alignments are available upon request and nexus files of the two basic data sets (including two trees each; expected and LBA) were also submitted to TREEBASE under the study accession number SN2312.

Phylogenetic Analyses
Phylogenetic analyses were performed at the amino acid level. Various models of sequence evolution were considered. We used Poisson (the same probability for all pairs), WAG (Whelan and Goldman, 2001), or JTT (Jones et al., 1992) amino acid replacement matrices with and without gamma-distributed rates across sites (Yang, 1993). Two different models were applied: (1) the separate model (Yang, 1996) where branch lengths and the {alpha} parameter are free to vary for all genes, and (2) the concatenated model that considers all genes as a "super-gene."

Two important limitations for finding the best tree become prominent when a large number of positions and a large number of species are used: (1) pronounced local minima and (2) computing time and memory requirements. The height of the potential barriers separating local minima increases with the number of positions used (Salter, 2001). The probability that the heuristic search is trapped in a local minimum is therefore much higher. As a consequence, we used mainly exhaustive tree searches for the ML analyses. Since the number of possible topologies is too large for an exhaustive search (1053 for 39 species), we proceeded in two steps.

First, for the data set comprising the 33 slowly evolving eukaryotic species and the six Archaea, several heuristic searches were performed. The methods used were MP implemented in PAUP* (Swofford, 2000), ML using PHYML (Guindon and Gascuel, 2003) with a concatenated JTT+F+{Gamma} 4 model, and Bayesian inference in MrBayes (Ronquist and Huelsenbeck, 2003) with a concatenated WAG+F+{Gamma} 4 model (150,000 generations, burn in of 14,500 generations, 4 chains). The parameter F (frequency) corresponds to the use, as equilibrium frequencies, of the amino acid frequencies observed in the data sets under study, instead of the ones obtained for the original data set used to infer the amino acid replacement matrix (WAG or JTT). The high memory requirements of the probabilistic analyses based on the concatenated data sets limited the modeling of among site rate variation to the use of four discrete gamma categories ({Gamma} 4). Distance methods were not used to infer trees, because they are sensitive to the presence of missing data in the alignment. All MP analyses were always performed without constrained trees and applied the following options: heuristic search with TBR, 10 random species additions, and 1000 Bootstrap replicates. All Bayesian inferences were performed three times independently and always converged towards the same posterior distributions. In the PHYML analyses, the starting tree was obtained using ML-based distance estimates and the algorithm BIONJ (Gascuel, 1997), the ML tree is subsequently obtained by nearest neighbor interchange (NNI). Given the high number of positions, most of the nodes were as expected highly supported by all methods and were thus constrained in the subsequent analyses. Only the relationships among the six main eukaryotic lineages and among the four main fungal lineages were left unconstrained (Appendix 4, available at www.systematicbiology.org). These constraints define 14,175 topologies, which were analyzed with a concatenated JTT+F model by PROTML (Adachi and Hasegawa, 1996b). We then retained the 1000 best topologies for further analyses, as in Bapteste et al. (2002) and Philippe et al. (2004). These topologies were analyzed with a separate WAG+F+{Gamma} model with the program Tree-Puzzle (Schmidt et al., 2002).

Second, we tried to locate the three fast-evolving lineages one at a time, namely the microsporidium Encephalitozoon, the nucleomorph of the cryptophyte Guillardia theta, and three kinetoplastids (Leishmania major, Trypanosoma brucei, and T. cruzi). Their possible locations in the phylogeny were analyzed exhaustively by adding them to all 75 branches of the 39 species tree (six Archaea and the 33 slowly evolving eukaryotes). However, because the topology of this tree is not known with certainty, we retained the 25 best topologies obtained with a separate WAG+F+{Gamma} model. At first sight, 25 topologies may seem to be a small number compared to the 1053 possible topologies. However, the two best topologies received together 99% of the RELL bootstrap support and the 26th topology is less likely than the best one by ten orders of magnitude ({Delta} lnL = 221). A total of 1875 different topologies (25 x 75) was thus analyzed to locate each fast evolving lineage.

Because the computation of bootstrap values is the most demanding task, we used the RELL method (Kishino et al., 1990). More precisely, the likelihood values of each tree for each gene and the corresponding branch lengths were computed using Tree-Puzzle. The likelihood of each position for each tree was then computed using CodeML of the PAML package (Yang, 1997b). The site-wise likelihood values were used by a home-made program to compute the RELL bootstrap values of each topology based on 1000 replicas. The bootstrap values (BVs) for the placement of the fast-evolving lineages should not be underestimated by the RELL procedure, since, despite the fact that we analyzed only 1875 (25 x 75) topologies, all possible positions of the fast lineages in the tree were studied. This approach allowed us to perform all computations in a reasonable time (about 3 months on a cluster with 30 Xeon 2.8 GHz processors).

The fit of models to data was evaluated using the Akaike Information Criterion (AIC) (Akaike, 1973). According to Burnham and Anderson (2003), a delta AIC value greater than 10 means that the competing model receives no support. Tree comparisons were performed using the approximate unbiased (AU) and the Shimodaira-Hasegawa (SH) (Shimodaira and Hasegawa, 1999) tests as implemented in the program CONSEL (Shimodaira and Hasegawa, 2001).

Removal of Fast-Evolving Proteins
To test whether LBA affects phylogenetic inference, we devised a method coined Removal of Fast-evolving Proteins (RFP). The fastest evolving proteins were detected and selectively eliminated in a protein specific way (Fig. 2). The distances were estimated by ML using the program Tree-Puzzle with the same model as for ML tree inference. They were calculated for the concatenation of all proteins as well as separately for each protein. The mean distances between the Archaea and the fast evolving eukaryotic lineages under study (like Encephalitozoon) were then calculated for both the concatenation and each of the proteins. Thereafter, the genes were sorted according to the quotient obtained by the following formula: [dmean,gene (Fast,Archaea)]/[dmean,concat (Fast,Archaea)]. The greater the value, the faster the evolutionary rate for this protein in comparison to the mean value obtained for all concatenated proteins. As shown in Figure 2, the fastest evolving proteins from the fast-evolving lineage were selectively eliminated for a given protein and replaced by question marks, the sequences of all other species remaining unchanged. This selective elimination of proteins was performed by steps of 10%, up to 90%.


Figure 2
View larger version (21K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Figure 2 Schematic illustration of the RFP method. The mean distances per protein between Archaea and the nucleomorph (in this example) are divided by the mean distances obtained for the complete data set. The quotient obtained is used to sort the proteins as a function of their relative evolutionary rate; for values > 1 the nucleomorph sequence (gene) is evolving faster than the mean of the total data set. Subsequently, each time 10% of the fastest evolving proteins are removed from the analysis, up to a maximum of 90%. The removed proteins of the nucleomorph are replaced by question marks and the rest of the data set remains unchanged. The complete as well as the nine new data sets with a reduced number of nucleomorph proteins are then analyzed and bootstrap values are computed.

 
The RFP method does not assume an a priori knowledge of the "correct" phylogeny and is therefore topology independent. We remove up to 90% of the fastest evolving proteins (a limit that allows conserving sufficient phylogenetic information). The topology may change as a function of protein elimination or remain the same. We chose cases in which we expect that a certain change will eventually occur; however, this is mainly a control. The only a priori knowledge required by the RFP method is the nature of the outgroup. Here, Archaea are fairly undisputed outgroup of eukaryotes.

Simulation Studies
We generated 100 matrices of 40 taxa and 24,294 amino acid positions under PSeq-Gen (Grassly et al., 1997) using the model topology shown in Figure 1, except that the nucleomorph was not considered. A separate model was used for simulations. More precisely, empirical amino acid frequencies, alpha parameter, and branch lengths were estimated for each protein separately. Then, for each protein, sequences of the size of this protein were simulated using the protein-specific parameters. The phylogenies were then inferred using the same protocol as for real data. With MP, heuristic search with 10 random species additions and TBR swapping was performed. With ML, all positions of the fast-evolving lineage were considered, but only the 10 best topologies connecting the 33 slow-evolving species, instead of 25, were retained, for computing time reasons. Simulation studies were also performed using a concatenated model, and the results were virtually identical to the separate model (data not shown). It should be noted that for the species rich data sets (32 taxa or more), only 10 replicates were analyzed with ML because of computing time limitations. However, because we obtained 100% for all 10 replicates, it is unlikely that the analysis of more replicates will fundamentally change the results.


    Results and Discussion
 Top
 Abstract
 Materials and Methods
 Results and Discussion
 Conclusion
 References
 
Removal of the Fastest Evolving Microsporidial and Nucleomorph Proteins
To simplify the study, the two fast-evolving species were analyzed separately. Beginning with microsporidia, a ML tree based on 133 genes (24,294 positions) inferred using either a separate WAG+F+{Gamma} model or a concatenated JTT+F+{Gamma} is shown in Figure 3. The tree is in excellent agreement with previous studies of eukaryotic phylogeny (Baldauf et al., 2000; Philippe et al., 2004). In particular the monophyly of all major phyla, for example Fungi, Metazoa plus Choanoflagellata (Holozoa), Conosa, green plants, stramenopiles, and Apicomplexa are recovered. Moreover, the monophyly of Opisthokonta (Fungi + Holozoa), Alveolata (Apicomplexa + ciliates), and Plantae (red algae + green plants) is found. However, the monophyly of Chromalveolata (alveolates and stramenopiles) (Cavalier-Smith, 2000; Fast et al., 2001) is not recovered. Within fungi, the grouping of ascomycetes and basidiomycetes, to the exclusion of chytridiomycetes and glomales, is supported by a bootstrap value (BV) of 100%. The early emergence of chytridiomycetes, until now only confirmed by a multigene phylogeny based on the mitochondrial genome (Bullerwell et al., 2003), is recovered, but not significantly supported. BVs are 86% and 66% for the separate and the concatenated analyses, respectively.


Figure 3
View larger version (41K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Figure 3 Apparently artefactual basal phylogenetic position of microsporidia inferred by the ML method based on 24,294 positions. The tree was inferred with a separate WAG+F+{Gamma} 8 model (using the exhaustive + constraints approach described in Materials and Methods). The phylogeny was also constructed with the concatenated JTT+F+{Gamma} 4 model using PhyML without any constraints. Bootstrap values were only indicated when below 100% (in bold for the separate WAG+F+{Gamma} 8 model and in italic for the concatenated JTT+F+{Gamma} 4 model). In the PhyML analyses a strong preference for the artefactual placement of the fast-evolving ciliate Tetrahymena in a basal position next to Microsporidia was supported by bootstrap values of 93%.

 
The microsporidium Encephalitozoon emerges at the base of eukaryotes with a high support (BV around 100%). An LBA artefact between the distantly related Archaea and the fast-evolving microsporidium likely explains this result. In fact, systematic biases constitute a serious issue when large data sets are used, even with a ML method and a reasonable species sampling (Philippe et al., 2005). However, the 133 genes of our data set do not all evolve at the same evolutionary rate in the microsporidial lineage. Therefore, in an attempt to overcome systematic biases, we assumed that the proteins that evolved the most slowly in microsporidia display a higher phylogenetic/nonphylogenetic signal ratio. We use the RFP method that progressively eliminates the fastest evolving proteins for microsporidia and studied the effect on phylogenetic inference (see Fig. 2 and Material and Methods for a detailed description). Only proteins of the fast-evolving species were removed, in order to maintain a large data set, given the difficulty in resolving the eukaryotic phylogeny with significant support (Philippe et al., 2000a; see Appendix 5 for the list of genes eliminated, available at www.systematicbiology.org).

As shown in Figure 4A, the application of the RFP method has a profound impact on the phylogenetic position of microsporidia. The removal of 50% of the fastest microsporidial proteins leads to a slight decrease of the BV for the early emergence of this group (from 97% to 78%). The removal of more proteins decreases these BVs much more rapidly, converging to 0% for a removal of 80% and 90%. This decrease could be simply due to the fact that too many proteins are removed and no phylogenetic signal remains. However, BVs for the grouping of microsporidia with fungi shows exactly the complementary trend, eventually converging to 100%. More precisely, the sum of the BVs for these two alternative positions of microsporidia (at the base of eukaryotes or with fungi) is always 100%. Therefore, our analysis strongly suggests that only two mutually exclusive signals exist for microsporidia: a nonphylogenetic signal due to LBA pulling them towards Archaea, and a genuine phylogenetic signal attracting them towards fungi. It should be noticed that both signals are strong. For example, with only 10% of the microsporidial proteins remaining (3709 positions), the grouping with fungi is supported by a BV of 100%. Even with a probabilistic tree reconstruction method using a complex model and a reasonable taxonomic sampling, it is necessary to remove an important fraction of the proteins, corresponding to the noisiest data, in order to avoid the LBA artefact. Interestingly, this also allows recovery of the expected phylogeny.


Figure 4
View larger version (12K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Figure 4 Relationship between bootstrap values and the percentage of removal of the fastest evolving proteins for microsporidium (A) and nucleomorph (B). The expected position (microsporidia with fungi and nucleomorph with red algae) is indicated with a close triangle and the apparently artefactual one (the fast-evolving lineage at the base of eukaryotes) with a close square. The trees were inferred with a separate WAG+F+{Gamma} 8 model, based on the same exhaustive + constraint search approach as in Figure 3.

 
We also applied the RFP method in the case of the nucleomorph (Fig. 4B). Exactly the same tendency is observed: the support for the apparently artefactual position (nucleomorph at the base of eukaryotes) decreases with sequence removal. Nevertheless, analysis of the complete dataset recovers the expected position of the nucleomorph (sister-group of red algae), but only with a BV of 58%. The support for this position rises to 95% at the removal of only 60% of the fast evolving nucleomorph proteins. The increase continues to a BV of 99% when additional proteins are removed. The difference between Figures 4A and 4B suggests that either the genuine phylogenetic signal is higher for nucleomorph than for microsporidia or the nonphylogenetic signal due to LBA is lower. Wiens (1998) shows that missing data may enhance the LBA artefact, because this mimics poor species sampling. However, our study shows that increasing the amount of missing data up to 90% allows the reduction of the LBA artefact, simply because the proteins that evolved the fastest in the lineage affected by the LBA have been removed. The relationships between LBA and missing data are thus complex and deserve further studies. Very recently, by using simulations, Wiens (2005) demonstrates the ability of incomplete taxa to reduce LBA when they break the long branches, in particular for model-based methods.

Relative Efficiency of Diverse Tree Reconstruction Methods
In order to evaluate the sensitivity of various tree reconstruction methods to the LBA artefact, we applied MP and ML methods to both the microsporidium (Fig. 5A) and the nucleomorph (Fig. 5B) data sets. In the case of the ML method, we compared the efficiency of models that deal with three kinds of heterogeneity in the evolutionary process: (1) the heterogeneity of amino acid replacement rates by comparing the Poisson, which assumes that all substitutions are equally likely, and the WAG replacement matrices (Whelan and Goldman, 2001); (2) the heterogeneity of replacement rates among positions (uniform or {Gamma} -distributed rates); (3) the heterogeneity of evolutionary rates between genes and species by comparing a concatenated model and a separate model that allows branch lengths and alpha parameter to vary from gene to gene (Yang, 1996). The evaluation of the relative efficiency is straightforward based on Figure 5: the better a given tree reconstruction method, the sooner (with a lower number of removed proteins) it will allow the recovery of a phylogeny not affected by the LBA artefact.


Figure 5
View larger version (17K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Figure 5 Efficiency of recovery of the position of microsporidia (A) and nucleomorph (B) with different tree reconstruction methods. Only the bootstrap values for the expected position of the fast-evolving species are indicated. All remaining 39 species were used and the protocol described in Figure 2 was applied with different models. For MP analyses, a heuristic search with TBR swapping and 10 random species addition was used. All ML based methods were using the exhaustive + constraint search approach. The evolutionary distances used for the rate specific elimination of proteins (RFP method) were always computed based on the same model as used in the corresponding analyses, with the only exception of the MP method for which a WAG+F model was used.

 
The only nonprobabilistic method applied, the MP method, performed poorly in both cases with BV of 0% for the expected solution (Fig. 5) and for all data sets up to 80% of protein removal. The BVs were different from 0% (up to 6% for microsporidia) only when 90% of the proteins were removed. The ML method with a simple and unrealistic model (separate Poisson+F without gamma) performs much better, recovering for example the monophyly of fungi + microsporidia with a BV of 94% when 90% of the fast proteins are removed. These results, obtained with real sequences, confirm previous results based on simulations (Anderson and Swofford, 2004; Huelsenbeck, 1998; Kuhner and Felsenstein, 1994; Qiu et al., 2001; Swofford et al., 2001). When some of the lineages evolve at markedly different rates, the use of probabilistic methods should be preferred over MP. A recent study (Kolaczkowski and Thornton, 2004) have demonstrated that MP outperforms ML when the level of heterotachy is extreme. However, this conclusion was based on simulation studies assuming a molecular clock and this does not hold when evolutionary rates vary considerably among lineages (unpublished results).

Considering the models of amino acid replacement, the Poisson model appears to be always less efficient than the WAG model (Fig. 5). For example, in the case of the nucleomorph with a {Gamma} distribution, it is necessary to remove 90% of the nucleomorph proteins to obtain a BV of 95% with a Poisson model, whereas the same BV is obtained through the removal of only 60% of the proteins with the WAG model (Fig. 5B). Taking the among site rate variation into account by the use of a {Gamma} distribution is also much more efficient against the LBA artefact both under Poisson and WAG matrices. These results demonstrate that ignoring the heterogeneity of the evolutionary process (for amino acid replacements and among positions) drastically reduces the accuracy of ML-based tree reconstruction methods.

Allowing for the possibility that different species evolve at different rates for different proteins produced less clear-cut results. For example, in the case of microsporidia, the concatenated WAG+F model is more sensitive to LBA than the separate WAG+F model, its performance being similar to that of the separate Poisson+F model (Fig. 5A). However, when a {Gamma} distribution is used, the concatenated and the separate models have similar efficiency. Indeed, in the case of the nucleomorph and a WAG+F+{Gamma} model, the concatenated analysis performs slightly better than the separate model, except when more than 80% of the proteins were removed.

Fit of the Model to the Data and Phylogenetic Accuracy
Because systematic errors occur when simplified models of sequence evolution used by the ML method are in conflict with the real evolutionary process, we evaluated how well the various models fit the data. We computed the AIC of each model for the nucleomorph data set (Table 1); the results are virtually identical for microsporidia (data not shown). As expected, the Poisson amino acid replacement matrix performs more poorly than JTT and WAG, whereas the WAG matrix has a slightly better fit to the data than JTT. The gamma distribution also improves greatly the fit of the model to the data (e.g., with separate model lnL = –744,406 WAG+F and lnL = –715,969 WAG+F+{Gamma}). Despite a serious increase in the number of parameters (12,804 additional parameters), the separate model has a better fit than the concatenated model (Table 1), according to the AIC. Therefore, taking into account the heterogeneity in the evolutionary process always improves the fit of the model to the data, albeit to noticeably different extents.


View this table:
[in this window]
[in a new window]

 
Table 1 Comparison of various models based on the Akaike Information Criterion (AIC). The separate model is always favored (lower AIC value) despite a serious increase in the number of free parameters.

 
The comparison of Table 1 and of Figure 5 confirms the hypothesis that using better models produces generally better phylogenies, in other words, that model misspecifications are the reason of the inconsistency of ML approaches. However, this relationship does not always hold (see Yang, 1997a), because the concatenated model sometimes performs better than the separate model, despite the fact that the separate model has a better fit. A possible explanation is that the estimation of branch lengths for each protein using a separate model is difficult, because only a limited number of positions are available. In contrast, this estimation is easier under the concatenated model. As a result, the microsporidial/nucleomorph branch is recognized as being very long, this allows the ML approach with a concatenated model to correct more efficiently for LBA artefacts.

Even the most complex models that we investigated (i.e., those readily available in current software packages) are sensitive to the LBA artefact; therefore the need for developing better tree reconstruction methods, in particular probabilistic ones with improved models of molecular evolution, is obvious. The protocol proposed here (Figs. 2 and 5) can be used as a way of assessment: a new method (model) will perform better if less data from fast-evolving species have to be removed in order to obtain the same BVs in favor of the grouping not affected by LBA. In particular, this benchmark could be used to test the efficiency of recently proposed methods with improved models, which deal with intrasite rate heterogeneity (i.e., heterotachy, Galtier, 2001; Huelsenbeck, 2002; Kolaczkowski and Thornton, 2004) and with the heterogeneity of the substitution process across sites (Lartillot and Philippe, 2004; Pagel and Meade, 2004).

Species Sampling and Sensitivity to LBA Artefacts
In phylogenomic studies, alignments contain often few taxa (Blair et al., 2002; Lerat et al., 2003; Philip et al., 2005; Rokas et al., 2003; Wolf et al., 2004). However, the accuracy of phylogenetic inference based on species-poor data sets is the subject of a long-standing controversy (Graur and Higgins, 1994; Hillis et al., 2003; Philippe and Douzery, 1994; Rosenberg and Kumar, 2003). To study the effect of species sampling, we progressively reduced the number of ingroup as well as outgroup species (see Appendix 3, available at www.systematicbiology.org, for the list of species used), while maintaining the number of positions (24, 294) and the method (ML with a separate WAG+F+{Gamma} model) constant.

In the case of the nucleomorph, the sensitivity to LBA generally increases as the number of species decreases (Fig. 6B). However, the performance obtained with 15 species (six Archaea, the nucleomorph, and eight eukaryotic species representing the major lineages; open diamond) is virtually identical to 40 species (open square), suggesting that the use of a single representative per major group is sufficient in this case. Nevertheless, the removal of a single additional eukaryotic species (14 species, open triangle) noticeably diminishes the efficiency. More significantly, when only a green plant and a red alga are used as representatives of the slowly evolving eukaryotes (close triangle), BVs for the expected position of the nucleomorph were always below 64%. With only three archaeal outgroups (close diamond), BVs were always below 30%, suggesting that the use of six outgroup species improves the inference.


Figure 6
View larger version (18K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Figure 6 Taxon sampling and the phylogenetic position of microsporidia (A) and nucleomorph (B). Only the bootstrap values (computed with a separate WAG+F+{Gamma} 8 model) for the expected position of the fast-evolving species are indicated. The highly reduced taxon sampling (six and nine species) corresponds to three eukaryotic ingroup species (microsporidia + Homo + Schizosaccharomyces or nucleomorph + Arabidopsis + Porphyra) and three or six archaeal species. For the sample of 14 and 15 species, six Archaea and the main eukaryotic lineages are present. For a detailed list of the species used, see Appendix 3 (available at www.systematicbiology.org).

 
The curves of the BVs for the grouping of nucleomorph with red algae are not perfect monotonous increasing functions of the percentage of proteins removed (Figs. 5B and 6B). For example, there is a slight decrease of BV when the first 10% proteins are removed. Two reasons probably explain the complexity of the curves. First, the RFP method is far from being perfect, one problem is that the fastest evolving proteins are not optimally detected by this method, because the power of the relative rate test is limited. (Bromham et al., 2000; Philippe et al., 1994). Second, after the removal of 90% of the proteins, 1885 amino acid positions were remaining for the nucleomorph. This low number of positions implies an increasing influence of the sampling error, rendering the curves irregular.

The results for microsporidia are similar (Fig. 6A). With six or nine species, even when 90% of the fast-evolving proteins are removed, the BVs for the grouping of Encephalitozoon with fungi remain below 10%. One of the most efficient tree reconstruction method used in this study (a separate WAG+F+{Gamma} model) is unable to overcome the LBA artefact, if only a few species are considered. Therefore taxa-poor phylogenomic studies should be regarded with great caution when species evolve at heterogeneous rates, in agreement with earlier studies (Adachi and Hasegawa, 1996a; Philippe and Douzery, 1994). For example, the paraphyly of Ecdysozoa observed in the analyses based on 100 genes/4 species (Blair et al., 2002), 500 genes/6 species (Wolf et al., 2004), and 780 genes/10 species (Philip et al., 2005) is most likely an artefact due to the high evolutionary rate of nematodes. This interpretation is in agreement with a study based on much wider taxon sampling, 146 genes/49 species (Philippe et al., 2005). It should be noticed that the species sampling used in this study can be easily improved, in particular by including several microsporidia and nucleomorphs in order to break their long branches. We predict that the quantity of data that have to be removed in order to overcome LBA will diminish accordingly.

However, the effect of taxon sampling is not based solely on the number of species, but also depends on the identity of the species (Lecointre et al., 1993). For example, for the nucleomorph (Fig. 6B), the LBA artefact is less marked when 15 species (open diamond) are used instead of 14 species (open triangle), whereas the contrary is observed for microsporidia (Fig. 6A). The nature of the outgroup can also have a great influence. The LBA is more pronounced in the case of the nucleomorph, when only Pyrococcus (open circle) instead of all six archaeal species (open square) is used as outgroup; this sample with 35 species is even worse than the samples with 14 or 15 species (Fig. 6B). However, in the case of microsporidia (Fig. 6A), the results with one or six Archaea are quite similar, demonstrating that the effect of taxon sampling on phylogenetic inference can be tremendously difficult to predict.

The analyses in which the closest sister-group of the fast-evolving lineages is discarded, corresponding to red alga for nucleomorph and fungi for microsporidia (indicated by close squares), are particularly interesting. In theory, the fast species should remain at the same position in the tree: they are expected to be a sister-group of green plants and of animals, respectively. Unfortunately, for the nucleomorph, even with the removal of 90% of the fast evolving proteins, the BVs for the expected position remain below 5% (Fig. 6B). Contrary to all previous analyses, there are now more than two alternative positions for the nucleomorph, because the sum of the BVs for the expected and the basal positions is sometimes less than 100%. Nevertheless, the support for the nucleomorph as first emerging eukaryotes is always greater than 85%, indicating that it is not possible to overcome LBA. For microsporidia (Fig. 6A), the situation is less drastic since the expected position, as a sister-group of animals, is recovered with a BV of 78% if the sequence removal is maximal (90%). This difference between nucleomorph and microsporidia is at first sight surprising, because it represents the only case in which the inference is easier for microsporidia. This is likely due to the fact that the recovery of the monophyly of opisthokonts, in this case microsporidia and animals, is less difficult than the one of Plantae, represented by the nucleomorph and green plants. Indeed, in another study using only slowly evolving species (Rodríguez-Ezpeleta et al., 2005), we have shown that it is necessary to use 5000 and 25,000 positions for obtaining a BV of 95% for the monophyly of opisthokonts and of Plantae, respectively.

An important conclusion can be drawn from the latter analyses: even when a large number of species and positions and an efficient tree reconstruction method are used, it turns out to be almost impossible to locate the fast-evolving lineages in the absence of closely related species in the data set. This probably explains why we were unable to place kinetoplastids (Leishmania major, Trypanosoma brucei, and T. cruzi) when we applied the RFP method. When the fast proteins are removed, the support for their early emergence decreases more quickly than in the case of the nucleomorph without red alga (from 100% with the complete alignment to 18% with 80% of kinetoplastid proteins removed). However, kinetoplastids do not cluster strongly with any group present in our dataset, the best BV being 34% for their grouping with Plantae (data not shown). Locating the fast-evolving eukaryotic groups such as kinetoplastids, diplomonads, or trichomonads with an archaeal outgroup will thus be a difficult and long-lasting task. The most straightforward approach would be to identify a slowly evolving and closely related group to these taxa. Thus, it is expected that several fast-evolving eukaryotic groups will artefactually remain at the base of the eukaryotic tree with a strong support, when numerous genes are used (Bapteste et al., 2002), until both improved species sampling and methodologies become available.

Phylogenetic Analyses in the Absence of the Distant Outgroup Archaea
To overcome the strong attraction between the distant archaeal outgroup and the fast-evolving ingroup, we have shown the need for good species sampling, an efficient tree reconstruction method and the removal of an important part of the fastest evolving proteins. As an alternative, the removal of the outgroup could allow the placement of problematic species, even if the question of the location of the root in the tree remains unsolved. The data sets without Archaea were analyzed separately with MrBayes and PHYML for both microsporidia and the nucleomorph. The results are strikingly different: the expected position of the fast-evolving species was recovered by MrBayes in both analyses with and without gamma-distributed rates, whereas either ciliates or alveolates and the fast-evolving species grouped together in the PHYML analyses. To verify that this difference is due to problems of the heuristic search (and not to a difference between ML and Bayesian approaches), various topologies were compared by LRT tests (Table 2). The expected position of both microsporidia and nucleomorph corresponds to the best ML tree and the LBA tree is always significantly rejected. The heuristic search of PHYML remains therefore trapped in a local minimum, illustrating the difficulty of heuristic searches when large data sets are considered. This argues in favor of our approach that combines topological constraints and an exhaustive search. However, when the closest sister-group of the fast-evolving species is eliminated (either the rhodophyte or fungi), the results of the analyses without outgroup are much less encouraging (Table 2). Nevertheless, our results confirmed the validity of the outgroup removal strategy for studying difficult phylogenetic questions.


View this table:
[in this window]
[in a new window]

 
Table 2 Comparison of the expected or LBA-related placement of the fast-evolving lineages nucleomorph and microsporidia (analyzed separately and without Archaea) according to AU (SH) test. Significant values in bold are below the 5% confidence level.

 
However, the removal of the outgroup is not necessarily the panacea: instead of being attracted by the outgroup, the fast-evolving lineage can be attracted by the longest ingroup branch (Philippe et al., 2005). To study this possibility, we have analyzed simultaneously nucleomorph and microsporidia (Table 3). Both fast-evolving species are at the expected position in the ML tree. However, the three alternative LBA artefact-based topologies are only significantly rejected with a {Gamma} model and when a closely related and slowly evolving sister group is present. We have also tested the heuristic search of MrBayes and of PHYML and confirmed that MrBayes always recovered the ML tree and PHYML the LBA tree. Finally, the MP analyses invariably group the two fast-evolving species together with a 100% bootstrap support. They formed a sister-group to ciliates, the fastest of the remaining eukaryotic species. The same highly supported sister-group relationship was also found by MP analyses including only one of the fast species. These analyses confirm the high sensitivity of the MP approach to LBA artefacts.


View this table:
[in this window]
[in a new window]

 
Table 3 Comparison of the expected and LBA related placements for both fast-evolving lineages nucleomorph and microsporidia without Archaea according to the AU (SH) test. Significant values in bold are below the 5% confidence level.

 
To gain insights regarding the position of the microsporidium Encephalitozoon within fungi, analyses in the absence of Archaea and more distantly related eukaryotes were carried out. Therefore, fungi and the microsporidium, together with animals, choanoflagellates, and the Conosa as outgroup sequences, were analyzed, using the RFP method with a separate WAG+ F+{Gamma} model. When 80% of the fastest proteins are removed (Fig. 7), the microsporidium is no longer in a basal position with respect to the fungi, but emerges after the chytridiomycete Neocallimastix, although only weakly supported by a BV of 55%. This analysis suggests that microsporidia emerge within fungi, but our limited sample of chytridiomycetes and glomales and their incompleteness (8309 and 5490 amino acid positions, respectively) reduces the efficiency of our approach. The absence of Entomophthorales and Zoopagales, groups that have been proposed to be closely related to microsporidia (Keeling, 2003) is problematic, but EST sequencing of additional fungi (http://amoebidia.bcm.umontreal.ca/public/pepdb/agrm.php) will soon allow us to address this problem with an adequate species sampling.


Figure 7
View larger version (15K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Figure 7 Phylogenetic position of microsporidia inferred with a close outgroup. The tree was inferred with a separate WAG+F+{Gamma} 8 model when 80% of the fastest evolving microsporidial genes were removed. The nodes which were constrained in the analysis are indicated by an*. All possible positions of Encephalitozoon were tested starting from the three possible alternative topologies.

 
Comparison of Simulated and Real Sequences
Our analyses demonstrate that the accuracy of current phylogenetic inference approaches are rather limited vis à vis LBA artefacts. However, simulation studies suggest that most methods are rather robust with respect to variable evolutionary rates among lineages (Guindon and Gascuel, 2003; Huelsenbeck, 1998; Kuhner and Felsenstein, 1994; Qiu et al., 2001; Swofford et al., 2001; Wolf et al., 2004). To gain further insights into this conundrum, we performed simulations to mimic the difficult case of microsporidia. Sequences were simulated with a complex model (separate JTT+F+{Gamma}) and trees were inferred by MP and by ML using various models. As shown in Table 4, even without any data removal, all methods, including MP, perform well, except when only three eukaryotic species are used (six and nine species). In these cases, ML requires the use of a {Gamma} model to recover the correct tree with high support. However, even an unrealistic model (Poisson+F instead of JTT+F+{Gamma}) recovers an important signal for the correct position of the fast evolving species (BV close to 50%) when so few species are used. Table 4 also clearly illustrates that inconsistency of the ML approach is due to model misspecifications, because the correct tree is always recovered when the correct model is used. It should be remembered that, with real data, even with the most complex model and the removal of 90% of the noisiest proteins, the expected position of microsporidia was virtually unsupported when few species are used (BV below 10%, Fig. 6A).


View this table:
[in this window]
[in a new window]

 
Table 4 Bootstrap support values for the correct location of the fast-evolving species in the case of simulated data sets, species sampling as in Figure 6. The 32 species analyses are corresponding to the 40 sister-groups (microsporidia data set without the closely related fungi). The detailed species sampling for all seven data sets is given in Appendix 3 (available at <webaddress url="www.systematicbiology.org">www.systematicbiology.org</webaddress>).

 

    Conclusion
 Top
 Abstract
 Materials and Methods
 Results and Discussion
 Conclusion
 References
 
All our analyses demonstrate that tree reconstruction methods are robust to the LBA artefact only when using simulated data. This suggests that simulation studies should be used with great care to evaluate whether a result is due to an LBA artefact. More importantly, experiments based on simulations had lead to overconfidence in the accuracy of tree reconstruction methods. We therefore believe that systematic errors, in particular due to LBA, constitutes a problem that should not be neglected in phylogenomics studies (Delsuc et al., 2005). To reduce their impact, we have shown that it is fundamental to (1) use probabilistic methods with complex models, (2) use a rich species sampling (including slowly evolving taxa closely related to the fast-evolving ones), and (3) remove a large proportion of the fast-evolving data.

In fact, a promising avenue in phylogenomics is to take advantage of the large number of positions available through the use of a subset of the data representing the most reliable characters, in order to obtain a phylogeny that minimizes systematic errors while remaining statistically significant. The fact that the RFP method is eliminating entire proteins from fast-evolving lineages (Fig. 2) does not mean that fast-evolving proteins are completely devoid of phylogenetic signal. A positional approach (Brinkmann and Philippe, 1999; Burleigh and Mathews, 2004; Pisani, 2004) could provide a better performance because it would more specifically remove the positions that mainly contain nonphylogenetic signal. We are currently evaluating the performance of these refined methods on the data sets used here.


    Acknowledgements
 
We wish to thank Frédéric Delsuc, Martin Embley, Nicolas Lartillot, Yu Liu, Nicolas Rodrigue, and Naiara Rodríguez-Ezpeleta for helpful comments on an earlier version of the manuscript. Furthermore we are grateful to associate editor Marshal Hedin and the two reviewers, John W. Stiller and Mark Fishbein, for suggestions helping to improve the manuscript. We thank Eric Bapteste for his help in aligning sequences of Neocallimastix and Glomus. HP was supported by the Canada Research Chair Program, the Université de Montréal, and a Bioinformatics Grant of Génome Québec.


    References
 Top
 Abstract
 Materials and Methods
 Results and Discussion
 Conclusion
 References
 

    Adachi J., Hasegawa M. Instability of quartet analyses of molecular sequence data by the maximum likelihood method: The Cetacea/Artiodactyla relationships. Mol. Phylogenet. Evol. (1996a) 6:72–76.[CrossRef][Web of Science][Medline]

    Adachi J., Hasegawa M. MOLPHY version 2.3: Programs for molecular phylogenetics based on maximum likelihood. Comput. Sci. Monogr. (1996b) 28:1–150.

    Aguinaldo A. M., Turbeville J. M., Linford L. S., Rivera M. C., Garey J. R., Raff R. A., Lake J. A. Evidence for a clade of nematodes, arthropods and other moulting animals. Nature (1997) 387:489–493.[CrossRef][Medline]

    Akaike H. Information theory and an extension of the maximum likelihood principle. In: Proceedings 2nd International Symposium on Information Theory—Petrov B. N., Csaki F., eds. (1973) Budapest: Akademia Kiado. Pages 267–281.

    Anderson F. E., Swofford D. L. Should we be worried about long-branch attraction in real data sets? Investigations using metazoan 18S rDNA. Mol. Phylogenet. Evol. (2004) 33:440–451.[CrossRef][Web of Science][Medline]

    Baldauf S. L., Roger A. J., Wenk-Siefert I., Doolittle W. F. A kingdom-level phylogeny of eukaryotes based on combined protein data. Science (2000) 290:972–977.[Abstract/Free Full Text]

    Bapteste E., Brinkmann H., Lee J. A., Moore D. V., Sensen C. W., Gordon P., Durufle L., Gaasterland T., Lopez P., Muller M., Philippe H. The analysis of 100 genes supports the grouping of three highly divergent amoebae: DictyosteliumEntamoebaMastigamoeba. Proc. Natl. Acad. Sci. USA (2002) 99:1414–1419.[Abstract/Free Full Text]

    Blair J. E., Ikeo K., Gojobori T., Hedges S. B. The evolutionary position of nematodes? BMC Evol. Biol. (2002) 2:7.[CrossRef][Medline]

    Brinkmann H., Philippe H. Archaea sister group of Bacteria? Indications from tree reconstruction artifacts in ancient phylogenies. Mol. Biol. Evol. (1999) 16:817–825.[Abstract]

    Brochier C., Philippe H. Phylogeny: A non-hyperthermophilic ancestor for bacteria? Nature (2002) 417:244.[CrossRef][Medline]

    Bromham L., Penny D., Rambaut A., Hendy M. D. The power of relative rates tests depends on the data. J. Mol. Evol. (2000) 50:296–301.[Web of Science][Medline]

    Bullerwell C. E., Forget L., Lang B. F. Evolution of monoblepharidalean fungi based on complete mitochondrial genome sequences. Nucleic. Acids Res. (2003) 31:1614–1623.[Abstract/Free Full Text]

    Burleigh J. G., Mathews S. Phylogenetic signal in nucleotide data from seed plants: Implications for resolving the seed plant tree of life. Am. J. Bot. (2004) 91:1599–1613.[Abstract/Free Full Text]

    Burnham K. P., Anderson D. R. Model selection and multimodel inference: A practical information-theoretic approach, 2nd ed. (2003) New York: Springer-Verlag.

    Busse I., Preisfeld A. Systematics of primary osmotrophic euglenids: A molecular approach to the phylogeny of DistigmaAstasia (Euglenozoa). Int. J. Syst. Evol. Microbiol. (2003) 53:617–624.[Abstract/Free Full Text]

    Castresana J. Selection of conserved blocks from multiple alignments for their use in phylogenetic analysis. Mol. Biol. Evol. (2000) 17:540–552.[Abstract/Free Full Text]

    Cavalier-Smith T. Membrane heredity and early chloroplast evolution. Trends Plant. Sci. (2000) 5:174–182.[CrossRef][Web of Science][Medline]

    Dacks J. B., Marinets A., Doolittle W., Cavalier-Smith T., Logsdon J. M. Jr. Analyses of RNA Polymerase II genes from free-living protists: Phylogeny, long branch attraction, and the eukaryotic big bang. Mol. Biol. Evol. (2002) 19:830–840.[Abstract/Free Full Text]

    Dacks J. B., Silberman J. D., Simpson A. G., Moriya S., Kudo T., Ohkuma M., Redfield R. J. Oxymonads are closely related to the excavate taxon Trimastix. Mol. Biol. Evol. (2001) 18:1034–1044.[Abstract/Free Full Text]

    Delsuc F., Brinkmann H., Philippe H. Phylogenomics and the reconstruction of the Tree of Life: Methods, advances, and challenges. Nat. Rev. Genet. (2005) 6:361–375.[Web of Science][Medline]

    Douglas S., Zauner S., Fraunholz M., Beaton M., Penny S., Deng L. T., Wu X., Reith M., Cavalier-Smith T., Maier U. G. The highly reduced genome of an enslaved algal nucleus. Nature (2001) 410:1091–1096.[CrossRef][Medline]

    Douzery E. J., Snell E. A., Bapteste E., Delsuc F., Philippe H. The timing of eukaryotic evolution: Does a relaxed molecular clock reconcile proteins and fossils? Proc. Natl. Acad. Sci. USA (2004) 101:15386–15391.[Abstract/Free Full Text]

    Embley T. M., van der Giezen M., Horner D. S., Dyal P. L., Bell S., Foster P. G. Hydrogenosomes, mitochondria and early eukaryotic evolution. IUBMB Life (2003) 55:387–395.[Web of Science][Medline]

    Fast N. M., Kissinger J. C., Roos D. S., Keeling P. J. Nuclear-encoded, plastid-targeted genes suggest a single common origin for apicomplexan and dinoflagellate plastids. Mol. Biol. Evol. (2001) 18:418–426.[Abstract/Free Full Text]

    Felsenstein J. Cases in which parsimony or compatibility methods will be positively misleading. Syst. Zool. (1978) 27:401–410.[Abstract/Free Full Text]

    Felsenstein J. Inferring phylogenies. (2004) Sunderland, Massachusetts: Sinauer Associates.

    Forterre P., Philippe H. Where is the root of the universal tree of life? BioEssays (1999) 21:871–879.[CrossRef][Web of Science][Medline]

    Foster P. G., Hickey D. A. Compositional bias may affect both DNA-based and protein-based phylogenetic reconstructions. J. Mol. Evol. (1999) 48:284–290.[CrossRef][Web of Science][Medline]

    Galtier N. Maximum-likelihood phylogenetic analysis under a covarion-like model. Mol. Biol. Evol. (2001) 18:866–873.[Abstract/Free Full Text]

    Gascuel O. BIONJ: An improved version of the NJ algorithm based on a simple model of sequence data. Mol. Biol. Evol. (1997) 14:685–695.[Abstract]

    Germot A., Philippe H., Le Guyader H. Evidence for loss of mitochondria in Microsporidia from a mitochondrial-type HSP70 in Nosema locustae. Mol. Biochem. Parasitol. (1997) 87:159–168.[CrossRef][Web of Science][Medline]

    Gibbs S. P. The chloroplasts of some algal groups may have evolved from endosymbiotic eukaryotic algae. Ann NY Acad. Sci. (1981) 361:193–208.[Web of Science][Medline]

    Grassly N. C., Adachi J., Rambaut A. PSeq-Gen: An application for the Monte Carlo simulation of protein sequence evolution along phylogenetic trees. Comput. Appl. Biosci. (1997) 13:559–560.[Free Full Text]

    Graur D., Higgins D. G. Molecular evidence for the inclusion of cetaceans within the order Artiodactyla. Mol. Biol. Evol. (1994) 11:357–364.[Abstract]

    Guindon S., Gascuel O. A simple, fast, and accurate algorithm to estimate large phylogenies by maximum likelihood. Syst. Biol. (2003) 52:696–704.[Abstract/Free Full Text]

    Hampl V., Cepicka I., Flegr J., Tachezy J., Kulda J. Critical analysis of the topology and rooting of the parabasalian 16S rRNA tree. Mol. Phylogenet. Evol. (2004) 32:711–723.[CrossRef][Web of Science][Medline]

    Hasegawa M., Fujiwara M. Relative efficiencies of the maximum likelihood, maximum parsimony, and neighbor-joining methods for estimating protein phylogeny. Mol. Phylogenet. Evol. (1993) 2:1–5.[CrossRef][Medline]

    Hendy M., Penny D. A framework for the quantitative study of evolutionary trees. Syst. Zool. (1989) 38:297–309.[Abstract/Free Full Text]

    Hillis D. M., Pollock D. D., McGuire J. A., Zwickl D. J. Is sparse taxon sampling a problem for phylogenetic inference? Syst. Biol. (2003) 52:124–126.[Free Full Text]

    Huelsenbeck J. P. Is the Felsenstein zone a fly trap? Syst. Biol. (1997) 46:69–74.[Abstract/Free Full Text]

    Huelsenbeck J. P. Systematic bias in phylogenetic analysis: Is the Strepsiptera problem solved? Syst. Biol. (1998) 47:519–537.[Web of Science][Medline]

    Huelsenbeck J. P. Testing a covariotide model of DNA substitution. Mol. Biol. Evol. (2002) 19:698–707.[Abstract/Free Full Text]

    Inagaki Y., Susko E., Fast N. M., Roger A. J. Covarion shifts cause a long-branch attraction artifact that unites microsporidia and archaebacteria in EF-1 {alpha} phylogenies. Mol. Biol. Evol. (2004) 21:1340–1349.[Abstract/Free Full Text]

    James T. Y., Porter D., Leander C. A., Vilgalys R., Longcore J. E. Molecular phylogenetics of the Chytridiomycota support the utility of ultrastructural data in chytrid systematics. Can. J. Bot. (2000) 78:336–350.[CrossRef]

    Jones D. T., Taylor W. R., Thornton J. M. The rapid generation of mutation data matrices from protein sequences. Comput. Appl. Biosci. (1992) 8:275–282.[Abstract/Free Full Text]

    Katinka M. D., Duprat S., Cornillot E., Metenier G., Thomarat F., Prensier G., Barbe V., Peyretaillade E., Brottier P., Wincker P., Delbac F., El Alaoui H., Peyret P., Saurin W., Gouy M., Weissenbach J., Vivares C. P. Genome sequence and gene compaction of the eukaryote parasite Encephalitozoon cuniculi. Nature (2001) 414:450–453.[CrossRef][Medline]

    Keeling P. J. Congruent evidence from alpha-tubulin and beta-tubulin gene phylogenies for a zygomycete origin of microsporidia. Fungal Genet. Biol. (2003) 38:298–309.[CrossRef][Web of Science][Medline]

    Keeling P. J., Fast N. M. Microsporidia: Biology and evolution of highly reduced intracellular parasites. Annu. Rev. Microbiol. (2002) 56:93–116.[CrossRef][Web of Science][Medline]

    Kim J. General inconsistency conditions for maximum parsimony: Effects of branch lengths and increasing numbers of taxa. Syst. Biol. (1996) 45:363–374.[Abstract/Free Full Text]

    King N., Hittinger C. T., Carroll S. B. Evolution of key cell signaling and adhesion protein families predates animal origins. Science (2003) 301:361–336.[Abstract/Free Full Text]

    Kishino H., Miyata T., Hasegawa M. Maximum likelihood inference of protein phylogeny, and the origin of chloroplasts. J. Mol. Evol. (1990) 31:151–160.[CrossRef][Web of Science]

    Kolaczkowski B., Thornton J. W. Performance of maximum parsimony and likelihood phylogenetics when evolution is heterogeneous. Nature (2004) 431:980–984.[CrossRef][Medline]

    Kuhner M. K., Felsenstein J. A simulation comparison of phylogeny algorithms under equal and unequal evolutionary rates. Mol. Biol. Evol. (1994) 11:459–468.[Abstract]

    Lake J. A., Rivera M. C. Was the nucleus the first endosymbiont? Proc. Natl. Acad. Sci. USA (1994) 91:2880–2881.[Free Full Text]

    Lang B. F., O'Kelly C., Nerad T., Gray M. W., Burger G. The closest unicellular relatives of animals. Curr. Biol. (2002) 12:1773–1778.[CrossRef][Web of Science][Medline]

    Lartillot N., Philippe H. A Bayesian mixture model for across-site heterogeneities in the amino-acid replacement process. Mol. Biol. Evol. (2004) 21:1095–1109.[Abstract/Free Full Text]

    Lecointre G., Philippe H., Le H. L. V., Le Guyader H. Species sampling has a major impact on phylogenetic inference. Mol. Phylogenet. Evol. (1993) 2:205–224.[CrossRef][Medline]

    Lerat E., Daubin V., Moran N. A. From gene trees to organismal phylogeny in prokaryotes: The case of the gamma-Proteobacteria. PLoS Biol. (2003) 1:E19.[Medline]

    Lockhart P. J., Larkum A. W., Steel M., Waddell P. J., Penny D. Evolution of chlorophyll and bacteriochlorophyll: The problem of invariant sites in sequence analysis. Proc. Natl. Acad. Sci. USA (1996) 93:1930–1934.[Abstract/Free Full Text]

    Lopez-Garcia P., Moreira D. Metabolic symbiosis at the origin of eukaryotes. Trends Biochem. Sci. (1999) 24:88–93.[CrossRef][Web of Science][Medline]

    Madsen O., Scally M., Douady C. J., Kao D. J., DeBry R. W., Adkins R., Amrine H. M., Stanhope M. J., de Jong W. W., Springer M. S. Parallel adaptive radiations in two major clades of placental mammals. Nature (2001) 409:610–614.[CrossRef][Medline]

    Martin W., Müller M. The hydrogen hypothesis for the first eukaryote. Nature (1998) 392:37–41.[CrossRef][Medline]

    Murphy W. J., Eizirik E., O'Brien S. J., Madsen O., Scally M., Douady C., Teeling E., Ryder O. A., Stanhope M. J., de Jong W. W., Springer M. S. Resolution of the early placental mammal radiation using Bayesian phylogenetics. Science (2001) 294:2348–2351.[Abstract/Free Full Text]

    Nozaki H., Matsuzaki M., Takahara M., Misumi O., Kuroiwa H., Hasegawa M., Shin I. T., Kohara Y., Ogasawara N., Kuroiwa T. The phylogenetic position of red algae revealed by multiple nuclear genes from mitochondria-containing eukaryotes and an alternative hypothesis on the origin of plastids. J. Mol. Evol. (2003) 56:485–497.[CrossRef][Web of Science][Medline]

    Pagel M., Meade A. A phylogenetic mixture model for detecting pattern-heterogeneity in gene sequence or character-state data. Syst. Biol. (2004) 53:571–581.[Abstract/Free Full Text]

    Philip G. K., Creevey C. J., McInerney J. O. The Opisthokonta and the Ecdysozoa may not be clades: Stronger support for the grouping of plant and animal than for animal and fungi and stronger support for the Coelomata than Ecdysozoa. Mol. Biol. Evol. (2005) 22:1175–1184.[Abstract/Free Full Text]

    Philippe H. Rodent monophyly: Pitfalls of molecular phylogenies. J. Mol. Evol. (1997) 45:712–715.[Web of Science][Medline]

    Philippe H., Douzery E. The pitfalls of molecular phylogeny based on four species, as illustrated by the Cetacea/Artiodactyla relationships. J. Mamm. Evol. (1994) 2:133–152.[CrossRef]

    Philippe H., Germot A. Phylogeny of eukaryotes based on ribosomal RNA: Long-branch attraction and models of sequence evolution. Mol. Biol. Evol. (2000) 17:830–834.[Free Full Text]

    Philippe H., Germot A., Moreira D. The new phylogeny of eukaryotes. Curr. Opin. Genet. Dev. (2000a) 10:596–601.[CrossRef][Web of Science][Medline]

    Philippe H., Lartillot N., Brinkmann H. Multigene analyses of bilaterian animals corroborate the monophyly of Ecdysozoa, Lophotrochozoa and Protostomia. Mol. Biol. Evol. (2005) 22:1246–1253.[Abstract/Free Full Text]

    Philippe H., Laurent J. How good are deep phylogenetic trees? Curr. Opin. Genet. Dev. (1998) 8:616–623.[CrossRef][Web of Science][Medline]

    Philippe H., Lopez P., Brinkmann H., Budin K., Germot A., Laurent J., Moreira D., Müller M., Le Guyader H. Early branching or fast evolving eukaryotes? An answer based on slowly evolving positions. Philos. Trans. R. Soc. Lond. B. Biol. Sci. (2000b) 267:1213–1221.[CrossRef]

    Philippe H., Snell E. A., Bapteste E., Lopez P., Holland P. W., Casane D. Phylogenomics of eukaryotes: Impact of missing data on large alignments. Mol. Biol. Evol. (2004) 21:1740–1752.[Abstract/Free Full Text]

    Philippe H., Sörhannus U., Baroin A., Perasso R., Gasse F., Adoutte A. Comparison of molecular and paleontological data in diatoms suggests a major gap in the fossil record. J. Evol. Biol. (1994) 7:247–265.[CrossRef][Web of Science]

    Phillips M. J., Delsuc F., Penny D. Genome-scale phylogeny and the detection of systematic biases. Mol. Biol. Evol. (2004) 21:1455–1458.[Abstract/Free Full Text]

    Pisani D. Identifying and removing fast-evolving sites using compatibility analysis: An example from the arthropoda. Syst. Biol. (2004) 53:978–989.[Free Full Text]

    Poe S. Evaluation of the strategy of long-branch subdivision to improve the accuracy of phylogenetic methods. Syst. Biol. (2003) 52:423–428.[Free Full Text]

    Poole A., Jeffares D., Penny D. Early evolution: Prokaryotes, the new kids on the block. Bioessays (1999) 21:880–889.[CrossRef][Web of Science][Medline]

    Qiu Y. L., Lee J., Bernasconi-Quadroni F., Soltis D. E., Soltis P. S., Zanis M., Zimmer E. A., Chen Z., Savolainen V., Chase M. W. The earliest angiosperms: Evidence from mitochondrial, plastid and nuclear genomes. Nature (1999) 402:404–407.[CrossRef]

    Qiu Y. L., Lee J., Whitlock B. A., Bernasconi-Quadroni F., Dombrovska O. Was the ANITA rooting of the angiosperm phylogeny affected by long-branch attraction? Amborella, Nymphaeales, Illiciales, Trimeniaceae, and Austrobaileya. Mol Biol Evol (2001) 18:1745–1753.[Abstract/Free Full Text]

    Rodríguez-Ezpeleta N., Brinkmann H., Burey S. C., Roure B., Burger G., Löeffelhardt W., Bohnert H. J., Philippe H., Lang B. F. Monophyly of primary photosynthetic eukaryotes: Green plants, red algae and glaucophytes. Current Biology (2005) 15:1325–1330.[CrossRef][Web of Science][Medline]

    Rokas A., Williams B. L., King N., Carroll S. B. Genome-scale approaches to resolving incongruence in molecular phylogenies. Nature (2003) 425:798–804.[CrossRef][Medline]

    Ronquist F., Huelsenbeck J. P. MrBayes 3: Bayesian phylogenetic inference under mixed models. Bioinformatics (2003) 19:1572–1574.[Abstract/Free Full Text]

    Rosenberg M. S., Kumar S. Taxon sampling, bioinformatics, and phylogenomics. Syst. Biol. (2003) 52:119–124.[Free Full Text]

    Salter L. A. Complexity of the likelihood surface for a large DNA dataset. Syst Biol (2001) 50:970–978.[Free Full Text]

    Sanderson M. J., Wojciechowski M. F., Hu J., Khan T. S., Brady S. G. Error, bias, and long-branch attraction in data for two chloroplast photosystem genes in seed plants. Mol. Biol. Evol. (2000) 17:782–797.[Abstract/Free Full Text]

    Schmidt H. A., Strimmer K., Vingron M., von Haeseler A. TREE-PUZZLE: Maximum likelihood phylogenetic analysis using quartets and parallel computing. Bioinformatics (2002) 18:502–504.[Abstract/Free Full Text]

    Shimodaira H., Hasegawa M. Multiple comparisons of log-likelihoods with applications to phylogenetic inference. Mol. Biol. Evol. (1999) 16:1114–1116.[Web of Science]

    Shimodaira H., Hasegawa M. CONSEL: For assessing the confidence of phylogenetic tree selection. Bioinformatics (2001) 17:1246–1247.[Abstract/Free Full Text]

    Simpson A. G., Roger A. J., Silberman J. D., Leipe D. D., Edgcomb V. P., Jermiin L. S., Patterson D. J., Sogin M. L. Evolutionary history of "early-diverging" eukaryotes: The excavate taxon Carpediemonas is a close relative of Giardia. Mol. Biol. Evol. (2002) 19:1782–1791.[Abstract/Free Full Text]

    Soltis P. S., Soltis D. E., Chase M. W. Angiosperm phylogeny inferred from multiple genes as a tool for comparative biology. Nature (1999) 402:402–404.[CrossRef]

    Stechmann A., Cavalier-Smith T. Rooting the eukaryote tree by using a derived gene fusion. Science (2002) 297:89–91.[Abstract/Free Full Text]

    Stiller J., Hall B. Long-branch attraction and the rDNA model of early eukaryotic evolution. Mol. Biol. Evol. (1999) 16:1270–1279.[Web of Science][Medline]

    Sullivan J., Swofford D. L. Should we use model-based methods for phylogenetic inference when we know that assumptions about among-site rate variation and nucleotide substitution pattern are violated? Syst. Biol. (2001) 50:723–729.[Free Full Text]

    Swofford D. L. PAUP*: Phylogenetic analysis using parsimony and other methods, version 4b10. (2000) Sunderland, Massachusetts: Sinauer Associates.

    Swofford D. L., Waddell P. J., Huelsenbeck J. P., Foster P. G., Lewis P. O., Rogers J. S. Bias in phylogenetic estimation and its relevance to the choice between parsimony and likelihood methods. Syst. Biol. (2001) 50:525–539.[Free Full Text]

    Teunissen M. J., Op den Camp H. J. Anaerobic fungi and their cellulolytic and xylanolytic enzymes. Antonie Van Leeuwenhoek (1993) 63:63–76.[CrossRef][Web of Science][Medline]

    Whelan S., Goldman N. A general empirical model of protein evolution derived from multiple protein families using a maximum-likelihood approach. Mol. Biol. Evol. (2001) 18:691–699.[Abstract/Free Full Text]

    Wiens J. J. Does adding characters with missing data increase or decrease phylogenetic accuracy? Syst. Biol. (1998) 47:625–640.[Abstract/Free Full Text]

    Wiens J. J. Missing data, incomplete taxa, and phylogenetic accuracy. Syst. Biol. (2003) 52:528–538.[Abstract/Free Full Text]

    Wiens J. J. Can incomplete taxa rescue phylogenetic analyses from long-branch attraction? Syst. Biol. (2005) 731–742.

    Wolf Y. I., Rogozin I. B., Koonin E. V. Coelomata and not Ecdysozoa: Evidence from genome-wide phylogenetic analysis. Genome. Res. (2004) 14:29–36.[Abstract/Free Full Text]

    Xue G. P., Orpin C. G., Gobius K. S., Aylward J. H., Simpson G. D. Cloning and expression of multiple cellulase cDNAs from the anaerobic rumen fungus Neocallimastix patriciarum in Escherichia coli. J. Gen. Microbiol. (1992) 138:1413–1420.[Abstract/Free Full Text]

    Yang Z. Maximum-likelihood estimation of phylogeny from DNA sequences when substitution rates differ over sites. Mol. Biol. Evol. (1993) 10:1396–1401.[Abstract]

    Yang Z. Maximum-likelihood models for combined analyses of multiple sequence data. J. Mol. Evol. (1996) 42:587–596.[CrossRef][Web of Science][Medline]

    Yang Z. How often do wrong models produce better phylogenies? Mol. Biol. Evol. (1997a) 144:105–108.

    Yang Z. PAML: A program package for phylogenetic analysis by maximum likelihood. Comput. Appl. Biosci. (1997b) 13:555–556.[Free Full Text]

    Yoon H. S., Hackett J. D., Pinto G., Bhattacharya D. The single, ancient origin of chromist plastids. Proc. Natl. Acad. Sci. USA (2002) 99:15507–15512.[Abstract/Free Full Text]


Add to CiteULike CiteULike   Add to Connotea Connotea   Add to Del.icio.us Del.icio.us    What's this?


This article has been cited by other articles:


Home page
Mol Biol EvolHome page
S. Simon, S. Strauss, A. von Haeseler, and H. Hadrys
A Phylogenomic Approach to Resolve the Basal Pterygote Divergence
Mol. Biol. Evol., December 1, 2009; 26(12): 2719 - 2730.
[Abstract] [Full Text] [PDF]


Home page
Appl. Environ. Microbiol.Home page
D. A. Caron, P. D. Countway, P. Savai, R. J. Gast, A. Schnetzer, S. D. Moorthi, M. R. Dennett, D. M. Moran, and A. C. Jones
Defining DNA-Based Operational Taxonomic Units for Microbial-Eukaryote Ecology
Appl. Envir. Microbiol., September 15, 2009; 75(18): 5797 - 5808.
[Abstract] [Full Text] [PDF]


Home page
J HeredHome page
J. M. Archibald and C. E. Lane
Going, Going, Not Quite Gone: Nucleomorphs as a Case Study in Nuclear Genome Reduction
J. Hered., September 1, 2009; 100(5): 582 - 590.
[Abstract] [Full Text] [PDF]


Home page
Proc. Natl. Acad. Sci. USAHome page
T. A. Castoe, A. P. J. de Koning, H.-M. Kim, W. Gu, B. P. Noonan, G. Naylor, Z. J. Jiang, C. L. Parkinson, and D. D. Pollock
From the Cover: Evidence for an ancient adaptive episode of convergent molecular evolution
PNAS, June 2, 2009; 106(22): 8986 - 8991.
[Abstract] [Full Text] [PDF]


Home page
Proc. Natl. Acad. Sci. USAHome page
S. V. Edwards
Natural selection and phylogenetic analysis
PNAS, June 2, 2009; 106(22): 8799 - 8800.
[Full Text] [PDF]


Home page
Mol Biol EvolHome page
Y. Inagaki, Y. Nakajima, M. Sato, M. Sakaguchi, and T. Hashimoto
Gene Sampling Can Bias Multi-Gene Phylogenetic Inferences: The Relationship between Red Algae and Green Plants as a Case Study
Mol. Biol. Evol., May 1, 2009; 26(5): 1171 - 1178.
[Abstract] [Full Text] [PDF]


Home page
Proc. Natl. Acad. Sci. USAHome page
V. Hampl, L. Hug, J. W. Leigh, J. B. Dacks, B. F. Lang, A. G. B. Simpson, and A. J. Roger
Phylogenomic analyses support the monophyly of Excavata and resolve relationships among eukaryotic "supergroups"
PNAS, March 10, 2009; 106(10): 3859 - 3864.
[Abstract] [Full Text] [PDF]


Home page
BioinformaticsHome page
L. Si Quang, O. Gascuel, and N. Lartillot
Empirical profile mixture models for phylogenetic reconstruction
Bioinformatics, October 15, 2008; 24(20): 2317 - 2323.
[Abstract] [Full Text] [PDF]


Home page
Syst BiolHome page
N. Wahlberg and C. W. Wheat
Genomic Outposts Serve the Phylogenomic Pioneers: Designing Novel Nuclear Markers for Genomic DNA Extractions of Lepidoptera
Syst Biol, April 1, 2008; 57(2): 231 - 242.
[Abstract] [Full Text] [PDF]


Home page
Mol Biol EvolHome page
S. W. Roy and M. Irimia
Rare Genomic Characters Do Not Support Coelomata: Intron Loss/Gain
Mol. Biol. Evol., April 1, 2008; 25(4): 620 - 623.
[Abstract] [Full Text] [PDF]


Home page
Proc. Natl. Acad. Sci. USAHome page
C. E. Lane, K. van den Heuvel, C. Kozera, B. A. Curtis, B. J. Parsons, S. Bowman, and J. M. Archibald
Nucleomorph genome of Hemiselmis andersenii reveals complete intron loss and compaction as a driver of protein structure and function
PNAS, December 11, 2007; 104(50): 19908 - 19913.
[Abstract] [Full Text] [PDF]


Home page
Mol Biol EvolHome page
J. Zheng, I. B. Rogozin, E. V. Koonin, and T. M. Przytycka
Support for the Coelomata Clade of Animals from a Rigorous Analysis of the Pattern of Intron Conservation
Mol. Biol. Evol., November 1, 2007; 24(11): 2583 - 2592.
[Abstract] [Full Text] [PDF]


Home page
Mol Biol EvolHome page
C. Grauvogel, H. Brinkmann, and J. Petersen
Evolution of the Glucose-6-Phosphate Isomerase: The Plasticity of Primary Metabolism in Photosynthetic Eukaryotes
Mol. Biol. Evol., August 1, 2007; 24(8): 1611 - 1621.
[Abstract] [Full Text] [PDF]


Home page
Syst BiolHome page
N. Rodriguez-Ezpeleta, H. Brinkmann, B. Roure, N. Lartillot, B. F. Lang, and H. Philippe
Detecting and Overcoming Systematic Errors in Genome-Scale Phylogenies
Syst Biol, June 1, 2007; 56(3): 389 - 399.
[Abstract] [Full Text] [PDF]


Home page
Syst BiolHome page
R. B. Bevan, D. Bryant, and B. F. Lang
Accounting for Gene Rate Heterogeneity in Phylogenetic Inference
Syst Biol, April 1, 2007; 56(2): 194 - 205.
[Abstract] [Full Text] [PDF]


Home page
Mol Biol EvolHome page
I. B. Rogozin, Y. I. Wolf, L. Carmel, and E. V. Koonin
Ecdysozoan Clade Rejected by Genome-Wide Analysis of Rare Amino Acid Replacements
Mol. Biol. Evol., April 1, 2007; 24(4): 1080 - 1090.
[Abstract] [Full Text] [PDF]


Home page
Mol Biol EvolHome page
N. Rodriguez-Ezpeleta, H. Philippe, H. Brinkmann, B. Becker, and M. Melkonian
Phylogenetic Analyses of Nuclear, Mitochondrial, and Plastid Multigene Data Sets Support the Placement of Mesostigma in the Streptophyta
Mol. Biol. Evol., March 1, 2007; 24(3): 723 - 731.
[Abstract] [Full Text] [PDF]


Home page
Syst BiolHome page
V. Ruano-Rubio and M. A. Fares
Artifactual Phylogenies Caused by Correlated Distribution of Substitution Rates among Sites and Lineages: The Good, the Bad, and the Ugly
Syst Biol, February 1, 2007; 56(1): 68 - 82.
[Abstract] [Full Text] [PDF]


Home page
Phil Trans R Soc BHome page
A. J Roger and L. A Hug
The origin and diversification of eukaryotes: problems with molecular phylogenetics and molecular clock estimation
Phil Trans R Soc B, June 29, 2006; 361(1470): 1039 - 1054.
[Abstract] [Full Text] [PDF]


Home page
Syst BiolHome page
N. Lartillot and H. Philippe
Computing Bayes Factors Using Thermodynamic Integration
Syst Biol, April 1, 2006; 55(2): 195 - 207.
[Abstract] [Full Text] [PDF]


This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (PDF) Freely available
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in ISI Web of Science
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrow Search for citing articles in:
ISI Web of Science (49)
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Brinkmann, H.
Right arrow Articles by Philippe, H.
Right arrow Search for Related Content
PubMed
Right arrow Articles by Brinkmann, H.
Right arrow Articles by Philippe, H.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?