Skip Navigation

Systematic Biology 2005 54(2):183-196; doi:10.1080/10635150590923254
This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (PDF) Freely available
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrow Search for citing articles in:
ISI Web of Science (7)
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Salamin, N.
Right arrow Articles by Savolainen, V.
Right arrow Search for Related Content
PubMed
Right arrow Articles by Salamin, N.
Right arrow Articles by Savolainen, V.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?

© 2005 Society of Systematic Biologists

Towards Building the Tree of Life: A Simulation Study for All Angiosperm Genera

Edited by Mark Holder: Associate Editor Chris Simon Editor

Nicolas Salamin1,2, Trevor R. Hodkinson1 and Vincent Savolainen2

1 Department of Botany, University of Dublin, Trinity College Dublin 2, Ireland E-mail: nicolas.salamin{at}unil.ch
2 Molecular Systematics Section, Jodrell Laboratory Royal Botanic Gardens, Kew, Richmond Surrey, TW9 3DS, London, UK


    Abstract
 Top
 Abstract
 MATERIALS AND METHODS
 RESULTS
 DISCUSSION AND CONCLUSION
 ACKNOWLEDGMENTS
 REFERENCES
 
Comprehensive phylogenetic trees are essential tools to better understand evolutionary processes. For many groups of organisms or projects aiming to build the Tree of Life, comprehensive phylogenetic analysis implies sampling hundreds to thousands of taxa. For the tree of all life this task rises to a highly conservative 13 million. Here, we assessed the performances of methods to reconstruct large trees using Monte Carlo simulations with parameters inferred from four large angiosperm DNA matrices, containing between 141 and 567 taxa. For each data set, parameters of the HKY85+{Gamma} model were estimated and used to simulate 20 new matrices for sequence lengths from 100 to 10,000 base pairs. Maximum parsimony and neighbor joining were used to analyze each simulated matrix. In our simulations, accuracy was measured by counting the number of nodes in the model tree that were correctly inferred. The accuracy of the two methods increased very quickly with the addition of characters before reaching a plateau around 1000 nucleotides for any sizes of trees simulated. An increase in the number of taxa from 141 to 567 did not significantly decrease the accuracy of the methods used, despite the increase in the complexity of tree space. Moreover, the distribution of branch lengths rather than the rate of evolution was found to be the most important factor for accurately inferring these large trees. Finally, a tree containing 13,000 taxa was created to represent a hypothetical tree of all angiosperm genera and the efficiency of phylogenetic reconstructions was tested with simulated matrices containing an increasing number of nucleotides up to a maximum of 30,000. Even with such a large tree, our simulations suggested that simple heuristic searches were able to infer up to 80% of the nodes correctly.

Keywords: Angiosperms; maximum parsimony; Monte Carlo simulations; neighbor joining; taxon sampling; Tree of Life

Received October 21, 2003; Revised February 8, 2004; Accepted November 1, 2004


A major challenge for systematists over the coming decades is to assemble a Tree of Life (ToL; Baldauf, 2003; Eisen and Fraser, 2003; Mace et al., 2003), and both U.S. and EU funding agencies are supporting workshops and projects to make the reconstruction of such a universal tree a major research focus. Regardless of the scale of the phylogenetic ToL problem (Stork, 1997, gave a conservative estimate of 13 million for the total number of species in the world), sampling large numbers of taxa is a requirement for obtaining a better understanding of macroevolutionary processes and for resolving broad systematic issues. With the advance of molecular techniques, immense numbers of DNA sequences are being produced (over 38 billion sequences were accessible in GenBank/EMBL/DDBJ in April 2004), and this resource has not been fully used in a phylogenetic context. Mining such a large amount of data in an efficient and accurate way is difficult due mainly to the nonconcerted sequencing effort. Algorithms designed to optimize the number of DNA regions and taxa that can be combined are becoming available, but the focus has been on obtaining numerous combinable DNA regions for relatively few taxa (Sanderson and Driskell, 2003). On the other hand, large and comprehensive phylogenetic trees based on a few DNA regions are also being analyzed (Miadlikowska and Lutzoni, 2000; Omilian and Taylor, 2001; Miya and Nishida, 2002; Adkins et al., 2003; Telford et al., 2003). For angiosperms in particular, complete family-level phylogenetic trees have already been built (e.g., Qiu et al., 2000; Savolainen et al., 2000a, 2000b; Soltis et al., 2000; Zanis et al., 2003). The next logical goal is to obtain complete genus-level estimates for all flowering plants, a task that will involve approximately 13,000 taxa (Mabberley, 1993). It is therefore important to understand whether existing methods of phylogenetic reconstruction are capable of accommodating large numbers of terminal taxa and to what extent new algorithms must be developed first.

Phylogenetic reconstructions belong to a set of computational decision problems that cannot be solved in polynomial time and for which no efficient algorithms for their solution are known to exist. The only practical solution is to rely on heuristic search strategies, with the risk of missing the optimal topology (Swofford et al., 1996). A search through the tree space to find the global optimum can be performed using branch-and-bound algorithms, for example, when the number of terminal taxa is low. But sampling a small fraction of the taxa in a large group of organisms can lead to long branch attraction and have serious effects on parsimony tree reconstruction (Felsenstein, 1978; Hendy and Penny, 1989; Kim, 1996; Steel, 2001; Huelsenbeck and Lander, 2003) and in model-based methods if the model of evolution is seriously misestimated (Huelsenbeck, 1995; Gaut and Lewis, 1995; Sullivan and Swofford, 1997). This problem can also affect assessments of the reliability of the results based upon resampling techniques such as bootstrap or jackknife. However, in the case of resampling, very simple heuristic strategies have been found to give a good approximation (Salamin et al., 2003), allowing large taxa sets to be analyzed. Other approaches such as supertree reconstructions have been proposed for building large phylogenetic trees, but similar computational problems affect most of these methods, with the exception of the MinCutSupertree algorithm (Semple and Steel, 2000; Bininda-Emonds et al., 2002; Salamin et al., 2002).

Building phylogenetic trees containing several hundreds of taxa could require a vast number of characters. Indeed, a previous study (Hillis et al., 1994) has demonstrated that very large molecular data sets are needed for accurate phylogenetic estimation. Under extreme evolutionary rate variation, correct recovery of the phylogeny of just four taxa requires both an accurate DNA substitution model and information from tens of thousands to millions of nucleotides (Hillis et al., 1994). Using less extreme evolutionary rates, Bininda-Emonds et al. (2000) showed that the increase in number of taxa required only an arithmetic increase in the number of characters to maintain the same level of accuracy. For example, even with 8192 taxa, no more than 1000 characters were needed to achieve 70% accuracy in their simulations. For more stringent levels of accuracy, however, the scaling of the number of characters was worse than the logarithm of the number of taxa.

Although dealing with DNA matrices containing large number of taxa could seem, a priori, an impossible task, Hillis (1996) reached an opposite conclusion. Using Monte Carlo simulations based on the 18S large subunit of nuclear ribosomal DNA (18S rDNA) for 228 angiosperm taxa (see Soltis et al., 1997, for details on the tree itself) and a simple model of DNA evolution, he showed that maximum parsimony (MP) and neighbor joining (NJ) could easily retrieve the model tree. Only one replicate was performed for both methods and only one model tree was used, but the results of Hillis (1996) suggested that, at least under some conditions, current heuristic strategies were able to infer large phylogenetic trees. Furthermore, Graybeal (1998) showed that accuracy of phylogenetic reconstruction was improved with the addition of taxa, whereas the improvement in accuracy was much less perceptible when the numbers of characters were increased. However, the way the taxa are added to a growing tree has also been shown to have a large impact on the accuracy (Kim, 1998). Furthermore, Rannala et al. (1998) concluded that, for a given number of taxa, the accuracy of the inferred phylogeny is increased if the terminal taxa represent a more complete sample of the extant taxa. Simply introducing more taxa will not increase the accuracy of the inferred phylogeny if they are poorly sampled or if these additional taxa share a more distant ancestor than the ingroup. With complete samples of taxa, Bininda-Emonds et al. (2000) further showed that shallow nodes were far easier to reconstruct than deeper nodes. In their simulations, the size of the tree did not affect the accuracy of inferring shallow nodes, but had a large influence on deeper nodes. For example, virtually none of the deeper nodes were inferred correctly for a 1024-taxa tree (Bininda-Emonds et al., 2000).

Purvis and Quicke (1997) proposed that the 18S rDNA angiosperm tree used by Hillis (1996) could be considered as a "perfect" tree, which luckily turned out to be well suited for MP analysis (for instance, with small mean number of substitutions per site in the tree). However, MP easily retrieved the same 18S rDNA tree even with, or perhaps because of, a 10-fold increase in the expected number of substitutions (Purvis and Quicke, 1997; Hillis, 1998), suggesting that MP can accommodate higher levels of substitutions than usually expected (Yang, 1998). This led Purvis and Quicke (1997) to argue, based on Hillis's (1996) results, for a diminished role of statistical models in large phylogenetic reconstruction. Yang and Goldman (1997) criticized this point and advocated instead the importance of statistical models in phylogenetic reconstruction for parameter estimation and hypothesis testing. They further emphasized that the argument regarding the difficulty in building large phylogenetic trees is only valid if we measure accuracy by the frequency with which the entire tree is correctly recovered (Yang and Goldman, 1997), a measure that would have given both MP and NJ accuracies near zero in simulations done by Hillis (1996). If the percentage of tree correct is used instead, large trees are not expected to be harder than smaller ones from a theoretical point of view (Yang and Goldman, 1997).

The aim of this study was to assess whether reconstructions such as Hillis's (1996) 18S rDNA tree are exceptions or whether large phylogenetic trees can often be accurately reconstructed using MP and NJ methods. In other words: "are the big [indeed] easy?" (Purvis and Quicke, 1997; Yang and Goldman, 1997). Angiosperms are an ideal test group to investigate this problem, because a number of large DNA matrices containing several hundreds of taxa are available. The same set of taxa form the majority of each of these matrices, with the largest matrix containing several hundred terminal taxa representing all angiosperm families. In particular, the two plastid coding genes rbcL and atpB have been sequenced for a large number of taxa so that direct comparisons of performances reached with these two genes and the 18S rDNA study from Hillis (1996, 1998) can be made. First we investigated MP and NJ performance using several angiosperm trees containing an increasing number of taxa (from 141, 228, 357, and 567) to be compared with Hillis's results (1996, 1998). Second, we used different topologies as models to compare the effect of different tree sizes and branch lengths distribution on the accuracy of the reconstructions. Third, we considered the effect of different models of DNA evolution on those MP and NJ reconstructions by using the two plastid genes. Finally, the feasibility of accurately reconstructing complete generic-level phylogenetic trees for the angiosperms was evaluated by comparing the performances of MP and NJ for building trees containing as many as 13,000 taxa with an increasing number of characters.


    MATERIALS AND METHODS
 Top
 Abstract
 MATERIALS AND METHODS
 RESULTS
 DISCUSSION AND CONCLUSION
 ACKNOWLEDGMENTS
 REFERENCES
 
Data Matrices and Phylogenetic Trees
The data matrix used by Hillis (1996) containing 228 angiosperm taxa for 18S rDNA from Soltis et al. (1997) was reanalyzed to obtain maximum likelihood branch lengths. Two other large 18S rDNA matrices, representing 141 (Chase and Cox, 1998) and 567 (Soltis et al., 2000) taxa were also used, while a subset of 320 18S rDNA sequences was taken from the 567 matrix of Soltis et al. (2000) to create a additional matrix of intermediate size. Apart from 18S rDNA, the 141- and 567-taxa matrices contain the two plastid gene sequences of atpB and rbcL that were used in this study. Last, a matrix containing 357 taxa (Savolainen et al., 2000a) for atpB and rbcL was also selected. The 320-taxa subset for 18S rDNA described above represents the taxa that are identical between the 357-taxa and 567-taxa matrices.

The three published matrices with 141, 357, and 567 taxa have been previously analyzed (Chase and Cox, 1998; Savolainen et al., 2000a; Soltis et al., 2000; respectively), and the most parsimonious trees published in these studies based on combined analyses were kept as the reference trees to be used in subsequent simulations. For the matrix with 320 taxa containing 18S rDNA, the 37 taxa that had no corresponding entry in the 357-taxa matrix were pruned from the most parsimonious tree published in Savolainen et al. (2000a), which was based on atpB and rbcL alone. We referred to those MP trees as the combined model trees. In order to make valid comparisons with Hillis (1996), MP analyses were performed on 18S rDNA alone for each matrix. Indeed, trees for 18S rDNA alone were not available from the publications cited above. Heuristic searches with 100 replicates of random addition sequence were performed, retaining up to 10 trees at each replicate, followed by nearest-neighbor interchange (NNI) swapping using PAUP*4b10 (Swofford, 2000). One randomly chosen most parsimonious tree was kept as the model tree for further analyses. In total, four trees with 141, 228, 320, and 567 terminal taxa were obtained from 18S rDNA alone (hereafter 18S model trees).

Simulations of atpB, rbcL, and 18S rDNA
Branch lengths and model parameters (HKY85+{Gamma}; Hasegawa et al., 1985; Yang, 1994) were estimated by maximum likelihood (ML) for all partitions of the four DNA matrices (i.e., 18S rDNA for the 18S and combined model trees, and atpB, rbcL, and their respective codon positions separately for the combined model trees) and a Gamma distribution was fitted on the branch length distributions using the software R 1.6.0 (http://www.r-project.org). The shape of the distributions was estimated by grouping the branch lengths into categories of 0.001 expected substitutions per site. To assess whether short and long branches were intermixed along the tree (thus creating the possibility of long branch attraction), the length ratio r of each internal branch to its longest daughter branch in the rooted model tree was calculated. The length ratios r were grouped as percentages into five categories: r < 0.25, 0.25 ≤ r < 0.5, 0.5 ≤ r < 0.75, 0.75 ≤ r < 1, r ≥ 1. We also computed a statistic for the observed number of substitutions per site by averaging the paths from the root to all tips of the tree on each model tree and dividing the total by the length of the actual sequence from each partition of the DNA matrices, providing an additional comparison of rates of change between partitions.

The HK85+{Gamma} model of DNA evolution is certainly a simplification, which ignores processes affecting phylogenetic inference such as heterogeneity across lineages and across time. However, the model does introduce a level of complexity through the distribution of branch lengths that can still affect phylogenetic inference and was deemed adequate for the current study. The transition/transversion ratio (ti/tv ratio) and rate heterogeneity among sites were estimated from the DNA sequences, and the empirical base frequencies were used. These parameters were then used to estimate the branch lengths of each 18S and combined model trees based on each data partition. We used the program evolver from the PAML3.1 package (Yang, 1997) to simulate matrices of different sizes based on each model tree. For each partition and model trees of 141, 228, 357 (or 320), and 567 taxa, matrices of DNA sequence lengths 100, 500, 1000, 3000, 5000, 7000, and 10,000 nucleotides (nt) were simulated. For simulations of the three-codon positions of atpB and rbcL, simulated DNA sequences of a limited number of parameter combinations were created. For each set of sequence length and data partition, 20 simulated data sets were subjected to MP or NJ analyses using PAUP*4b10 running on a 32-node IBM NetFinity cluster (dual Intel Pentium III, 1 GHz, 1 Gb RAM each). MP heuristic searches were performed with 100 random addition-sequence replicates of stepwise addition followed by tree bisection and reconnection (TBR) swapping keeping only ten trees at each replicate. In order to investigate whether the errors in the MP searches were due to poor search strategies or failure of the optimality criterion to identify the correct tree, each MP search was accompanied by a search starting from the model tree followed by TBR swapping on all trees found. For NJ, both K2P (Kimura, 1980) and the HKY85+{Gamma} distances were used. The percentages of correctly recovered trees were calculated with the software TreeCorrect1.2, developed by one of us (NS; http://www.tcd.ie/Botany/NS/software.html), by counting the number of nodes from the model trees that were present in the trees resulting from the simulated data sets (following Hillis, 1996). When multiple equally most parsimonious trees were obtained, a node was considered as correct only if it was found in all saved trees. The percentages were then averaged over the 20 replicates. In order to avoid giving an advantage to MP over NJ by counting polytomies as being 50% correct (Yang and Goldman, 1997), we forced any polytomies in the MP trees to be randomly dichotomized with the option "collapse = no" in PAUP*4b10.

Simulating the Tree of 13,000 Angiosperm Genera
We aimed to simulate a "realistic" tree of all 13,000 angiosperm genera for which the 18S rDNA would be sequenced for all taxa; special attention was given to tree shape and branch lengths as follows. Data sets containing 13,000 taxa were simulated using the parameters of the HKY85+{Gamma} model derived from the 18S rDNA for 567 taxa, the most comprehensive tree for angiosperm families (18S rDNA, atpB and rbcL; Soltis et al., 2000), and the data set that gave the best results for the largest tree (see Results below). A tree having an imbalance index similar to the 567 taxa tree (0.687 ± 0.05), following Fusco and Cronk (1995), was created using a pure birth or Yule process using the software GenTree0.5b, developed by one of us (NS; http://www.tcd.ie/Botany/NS/software.html). In order to obtain a nonclocklike tree, the time of speciation from the Yule process was discarded and branch lengths were assigned by randomly drawing branch lengths from a Gamma distribution having the same shape parameter as the one estimated from the 567 taxa matrix for 18S rDNA based on the 18S model tree. Each daughter branch and its parent were further constrained to be no more than five times longer than each other. A modified version of evolver (Yang, 1997) was used to accommodate the matrix sizes and was used to create data sets containing 100, 500, 1000, 3000, 5000, 10,000, 15,000, and 30,000 nt. The MP and NJ analyses were run on the same IBM cluster using PAUP4*b10. MP analyses were performed using simple addition sequence followed by NNI swapping, without keeping multiple trees, whereas NJ used the HKY85+{Gamma} distance. The resulting trees were then analyzed using TreeCorrect1.2 as described above.


    RESULTS
 Top
 Abstract
 MATERIALS AND METHODS
 RESULTS
 DISCUSSION AND CONCLUSION
 ACKNOWLEDGMENTS
 REFERENCES
 
Parameters of the Models
ML estimation of parameters for ti/tv ratio, the percentage of GC content, and rate heterogeneity among sites are shown in Table 1. Third codon positions of rbcL and atpB had the highest ti/tv ratios of all codon positions ranging between 2.636 and 2.938, whereas second codon positions of rbcL had the lowest ti/tv ratios (0.588 to 0.598; Table 1). By contrast, second codon positions of atpB had much higher ti/tv ratios than rbcL. When all codon positions were considered together, the ti/tv ratios were similar for atpB, rbcL, and 18S rDNA (1.812 to 2.555; Table 1), with rbcL generally having the lowest and atpB the highest ratios. The GC content was similar between the two plastid coding genes, and averaged 58%, 42%, and 30% in first, second, and third codon positions, respectively (Table 1). The amount of GC versus AT in the 18S rDNA was close to 50% in all but the 228-taxa matrix, where the GC content was 42% (Table 1). For most data partitions, there was a decrease in heterogeneity among sites when more taxa were added, although the differences were small. The third codon positions had less heterogeneity among their sites, with Gamma distributions ranging from alpha = 0.932 to 1.382 (Table 1). First and second codon positions had more similar values, with more heterogeneity within the second codon positions. The values for 18S rDNA were similar to the first and second codon positions, whereas all three codon positions taken together had slightly lower heterogeneity.


View this table:
[in this window]
[in a new window]

 
TABLE 1. ML estimates of the parameters of the HKY85+{Gamma} model of substitution for the four sizes of trees used during the simulations (see text for details).

 
Branch Lengths and Number of Substitutions
The shape of the Gamma distribution fitted on the branch length distribution for the four tree sizes and each partition is given in Table 2. The value of the shape parameter ranged from 0.878 to 1.161 for the 18S model tree branch lengths, with larger values for 357 and 567 taxa trees (Table 2).


View this table:
[in this window]
[in a new window]

 
TABLE 2. Estimates of the alpha shape of the Gamma distribution (with a scale of 1) from the branch length distribution for the four tree sizes on all data partitions (see text for details).

 
The values for the combined model tree with branch lengths estimated on 18S rDNA alone were slightly lower (0.871 to 1.060) than noncombined matrices, whereas those estimated on atpB and rbcL were higher than for 18S rDNA (1.174 to 1.993; Table 2). When looking at the branch length distributions obtained from the codon positions of atpB and rbcL, first and second codon positions had lower alpha values that the third codon position (Table 2). Second codon positions for atpB and rbcL taken separately had alpha values ranging from 0.307 to 0.401, first codon position values were between 0.415 and 0.578 and third codon position values were similar to 18S rDNA, ranging between 0.903 and 1.212 (Table 2).

The mean number of substitutions per site increased with the addition of more taxa, and values for 18S rDNA were similar to the second codon positions of atpB and rbcL (Table 3). First codon positions had 1.5- to 2-fold more substitutions per site than the second codon position, whereas the increase for the third codon positions was 7- to 9-fold (Table 3).


View this table:
[in this window]
[in a new window]

 
TABLE 3. Statistics of mean number of substitutions per site for the four tree sizes based on all data partitions.

 
The ratios for lengths of each internal branch divided by its longest daughter branch are shown in Table 4. The values represent percentages of ratios found in each category. The first category (r < 0.25, very heterogeneous, where the parent branches were more than four times smaller than their daughter) represented the majority of ratios found in each topology and character partitions (Table 4). The percentages within this category varied from 38.62% for 18S rDNA on the 18S model tree containing 320 taxa to 73.18% for the second positions of atpB on the combined model tree containing 141 taxa (Table 4). A recurrent pattern was that second codon positions, for all tree sizes and for both coding genes, had ca. 70% of their ratios in this category. They were followed by the first codon positions with between 55.31% and 66.66%, whereas third codon positions had similar percentages than the whole atpB, rbcL genes, and 18S rDNA (Table 4). The categories with the next highest percentage of ratios (towards more homogeneous parent/daughter branch lengths) were the second (0.25 ≤ r < 0.5) and fifth (r ≥ 1), whereas the third (0.75 ≤ r < 0.75) and fourth (0.75 ≤ r < 1) were the least represented categories (Table 4).


View this table:
[in this window]
[in a new window]

 
TABLE 4. Ratios of each internal branch length divided by its longest daughter branch length grouped by categories.

 
18S rDNA, atpB, and rbcL-Like Trees
The efficiencies of MP and NJ for estimating the combined model trees and 18S rDNA model tree from the simulated data sets are shown in Figure 1. First we used HKY85+{Gamma} distances for NJ searches. For the simulations based on the combined model trees (Fig. 1A), MP and NJ correctly inferred less than 85% of nodes with 10,000 nt. Starting with 100 nt, there was a steep increase in the percentages of tree branches correctly found until the sequence length reached 1000 nt. Then, a plateau was reached with each simulated data set from 1000 or 3000 nt for MP and NJ, respectively, with a slow increase in percentages afterwards. This pattern was found in all simulations performed (Fig. 1, Fig. 2, Fig. 3, Fig. 4, Fig. 5 and Fig. 6, see below). The largest tree with 567 taxa proved more difficult to recover for both methods than the smaller 141 taxa tree with any sequence length, whereas the lowest percentages were found with the 320-taxa 18S rDNA tree, from which less than 75% and 65% of the nodes were correctly recovered with 10,000 nt for MP and NJ, respectively (Fig. 1A). A different result was found when the model tree was created based on the 18S data sets exclusively (Fig. 1B). With 10,000 nt, MP correctly inferred 99% of the nodes from the 141 taxa tree, with 97% reached with 3000 nt (Fig. 1B). Slightly lower values were obtained with the 228-taxa tree, a result similar to Hillis's (1996) analyses of the same matrix (Fig. 1B). The two larger trees with 320 and 567 taxa obtained similar percentages that were lower than the two smaller trees (93% with 10,000 nt; Fig. 1B). The results obtained with NJ followed the same pattern than MP with a decrease in the percentages of tree correct as the size of the tree increased, except that the percentages were slightly lower (Fig. 1B).


Figure 1
View larger version (20K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
FIGURE 1. Efficiency of MP and NJ (HKY85+{Gamma} distance) for estimating the different model trees from the simulated data sets based on 18S rDNA parameters. Results represent the average over 20 replicates of simulations, with vertical bars indicating standard deviation around the mean. (A) Combined trees based on combined analyses of atpB, rbcL, and 18S rDNA (matrices with 141 and 567 taxa) and on combined analysis of atpB and rbcL (matrices with 357 taxa). (B) 18S trees based on analyses of 18S rDNA for 141, 228, 320, and 567 taxa.

 

Figure 2
View larger version (19K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
FIGURE 2. Efficiency of MP and NJ (HKY85+{Gamma} distance) for estimating the combined model tree from the simulated data sets from 141, 357, and 567 taxa. Results represent the average over 20 replicates of simulations, with vertical bars indicating standard deviation around the mean. (A) Simulations based on atpB parameters. (B) Simulations based on rbcL parameters.

 

Figure 3
View larger version (19K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
FIGURE 3. Efficiency of NJ algorithm using K2P distances for estimating the different model trees from the simulated data sets. Results represent the average over 20 replicates of simulations, with vertical bars indicating standard deviation around the mean. (A) Simulations based on 18S rDNA parameters estimated on the combined model tree. (B) Simulations based on 18S rDNA parameters estimated on the 18S model tree. (C) Simulations based on atpB parameters estimated on the combined model tree. (D) Simulations based on rbcL parameters estimated on the combined model tree.

 

Figure 4
View larger version (27K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
FIGURE 4. Efficiency of NJ (HKY85+{Gamma} distance) for estimating the different model trees from the simulated data sets based on the three codon positions of atpB and rbcL. Results represent the average over 20 replicates of simulations, with vertical bars indicating standard deviation around the mean. (A) First codon position, (B) second codon position, and (C) third codon position.

 

Figure 5
View larger version (27K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
FIGURE 5. Efficiency of MP for estimating the different model trees from the simulated data sets based on the three codon positions of atpB and rbcL. Results represent the average over 20 replicates of simulations, with vertical bars indicating standard deviation around the mean. (A) First codon position, (B) second codon position, and (C) third codon position.

 

Figure 6
View larger version (7K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
FIGURE 6. Efficiency of MP and NJ (HKY85+{Gamma} distance) for estimating the model tree from the simulated data sets containing 13,000 taxa. Simulations using NJ could not be run to completion for matrices containing more than 10,000 nt due to computational problems (see text).

 
The percentages of nodes correctly inferred by MP and NJ with simulated data sets based on atpB and rbcL parameters also quickly reached a plateau with only a slow increase in accuracy as the sequence length increased from 3000 nt (Fig. 2A and B). Simulations based on atpB resulted in MP and NJ recovering 90% of the nodes from the 357- and 567-taxa tree with 10,000 nt, but only 82% of the nodes from the smaller 141-taxa tree (Fig. 2A). Results obtained with simulations based on rbcL (Fig. 2B) were similar for both methods with the smaller 141-taxa tree being more difficult to recover. This was particularly true for the simulations analyzed with NJ that showed large standard deviations (Fig. 2A), especially for atpB.

Starting the MP search from the model trees did not increase substantially the percentage of tree correct (data not shown). This would suggest that the errors in the tree reconstruction procedure are mainly arising from a failure of the parsimony criterion to identify the correct tree rather than from inadequate searches through the tree space. For example, no search starting from the model trees was able to infer 100% of the nodes correctly, and even with very few characters available during the search the initial search strategy was able to find similar trees as the one starting from the model trees.

The simulated data sets were also analyzed by NJ using the simpler K2P model for generating distances (Fig. 3), therefore not taking into account heterogeneity of rate among sites and base frequencies, as used by Hillis (1996). Although the relative pattern found remained similar to NJ using HKY85+{Gamma} distance for each DNA sequence or model trees analyzed, the percentages of nodes correctly inferred were consistently lower. The differences in percentages of nodes correctly recovered between the two distances used ranged from a few percent with 18S rDNA simulations based on 18S model trees to little more than 10% with 18S rDNA simulations based on combined model trees (Fig. 1 and Fig. 2 versus Fig. 3).

Simulations of Codon Positions
Simulations and subsequent phylogenetic reconstructions based on the three codon positions of atpB and rbcL resulted in a plateau reached in a similar manner as in previous simulations (Figs. 4 and 5). With MP and NJ, and for the different tree sizes, simulated data sets based on the second codon positions performed relatively poorly with only 40% of the nodes correctly inferred for atpB and 43% for rbcL even with 10,000 nt (Figs. 4B and 5B). Simulations based on the first codon positions performed slightly better, reaching between 50% and 55% of trees correct for atpB and between 50% and 64% for rbcL again with 10,000 nt (Figs. 4A and 5A). With NJ searches based on rbcL-like data, the 141-taxa topology proved much more difficult to correctly reconstruct than with 357 and 567 taxa, with only 50% of the nodes correctly inferred versus 63% and 64%, respectively (Fig. 4A). This trend between data sets with different number of taxa for the first codon position of rbcL is less present in the MP analyses (Fig. 5A). Finally, simulations based on the third codon positions and analyzed either with NJ or MP resulted in ca. 85% of the nodes correctly inferred with 10,000 nt (Figs. 4C and 5C). However, the 141-taxa tree data sets for rbcL-like sequences were more problematic, with both MP and NJ searches performing poorly (70% to 75% of the nodes correctly recovered; Fig. 4 and Fig. 5).

Tree of 13,000 Angiosperm Genera
Although tree searches with MP (as described above) and 30,000 nt ran to completion (see below), analyses with NJ could not be performed for data matrices larger than 10,000 nt due to the time burden involved in the calculations of the distance matrices containing larger number of nt (for practical reasons, an arbitrary limit of 96 hours of CPU run time on the 32-node IBM NetFinity cluster was set for all simulations). NJ analyses for such large matrices required far more CPU time to complete than the heuristic searches peraformed for MP when set as described in Materials and Methods. For example, a data matrix with 100 nt took 1h20:35 and 1h15:30 of CPU time for MP and NJ, respectively. With 10,000 nt, the time spent for the searches was 3h29:55 and 89h20:23 for MP and NJ, respectively, whereas 30,000 nt were analyzed in 5h19:41 by MP. We also found that MP searches outperformed NJ with more than 1000 nt (Fig. 6). With 10,000 nt, NJ correctly inferred 63% of the nodes, whereas MP correctly inferred 80%. The percentage of tree correct continued to steadily increase with the addition of more characters to reach 82% with 30,000 nt when analyzed with MP. The number of characters that would be required to correctly infer 100% or the nodes from this 13,000-taxa tree (and assuming the increase is monotonous) was estimated from a logarithmic model fitted to the points obtained with MP (y = 8.461 x log(x); r = 0.984), leading to a total of 135,800 nt, approximately the size of many plastid (chloroplast) genomes.


    DISCUSSION AND CONCLUSION
 Top
 Abstract
 MATERIALS AND METHODS
 RESULTS
 DISCUSSION AND CONCLUSION
 ACKNOWLEDGMENTS
 REFERENCES
 
Tree Sizes
The efficiency of two different phylogenetic methods to reconstruct four large angiosperm trees was assessed in this study using computer simulations. Both MP and NJ performed relatively well under the conditions of the simulations despite the large number of possible trees existing for such large matrices. The sizes of the matrices rendered ML searches unpractical for the purpose of this study, but recent approaches (e.g., Guindon and Gascuel, 2003) allow the ML criterion and Bayesian approaches (Huelsenbeck and Ronquist, 2001) to become a feasible and useful strategy even for large phylogenetic analyses.

The results obtained with these simulations are in agreement with Hillis (1996), suggesting that the 228-taxa tree for 18S rDNA is not an exception and several other large angiosperm trees can similarly be reconstructed by MP and NJ using different DNA sequences. These results contrast with a recent study by Huelsenbeck and Lander (2003), who showed that MP was often inconsistent even under a simple model of cladogenesis used to simulate the trees. A method is consistent if, and only if, the entire tree is always inferred correctly (Yang and Goldman, 1997), something we are not measuring here. However, the percentage of tree correct can give an idea of how accurate the method is. When a plateau of 95% of the nodes correct is reached with increasing number of characters, the method is perhaps not consistent. But it indicates that the method infers correctly most of the tree under the conditions investigated, a property worth knowing in itself. Huelsenbeck and Lander (2003) also used ultrametric trees and a very different measure of accuracy and although the tree size between our studies is completely different (Huelsenbeck and Lander [2003] simulated up to eight species only), their results beg the question of whether the procedure used to simulate our data is somewhat biased. Indeed, Hillis (1996) used a model tree that was the MP tree for the data. This can introduce a favorable bias for MP by placing long branches adjacent to each other and transforming any "Felsenstein zone" situation into a "Farris zone" (Rannala et al., 1998). The same critic can be said about the study presented here. However, with the exception of the 18S trees, our model trees came from the analyses of the three regions combined together and not from a direct analysis of the DNA region investigated. This creates discrepancies, albeit small, between the optimal MP topology for each gene and the one used as model tree. The difference can readily be seen in Figure 1, where although the same model of evolution is used, the model tree is different. When the 18S tree is used, the percentages of tree correct go up to above 95% for all tree sizes (Fig. 1B). But when the combined tree, which is not optimal for the 18S rDNA sequences, is used instead, the percentages only reached between 70% and 85%. A limited number of simulations were done on random model trees for all tree sizes investigated here using the 18S rDNA model parameters and 5000 nt (data not shown) and the percentages further decrease to around 65% to 75%. Simulations of both the topologies with a birth/death process and the DNA sequences for trees of comparable sizes as the one presented here, therefore avoiding the potential bias for MP, have been done by Bininda-Emonds et al. (2000). Although they used the consensus-fork index as their measure of accuracy, results comparable to ours were obtained. Nevertheless, it is important to keep in mind that our simulations probably represent a best-case scenario for such large reconstructions. It would be important to assess the impact of additional factors such as mass extinction, violation of the molecular clock, and more complicated models of substitutions on the consistency of the methods of reconstruction (Bininda-Emonds et al., 2000; Huelsenbeck and Lander, 2003).

In all our simulations, the pattern of results was similar in each simulation performed, with a steep increase in the success of MP and NJ to reconstruct the model trees when the number of characters increased from 100 to 1000 or 3000 nt, followed by a plateau where the rise in percentages were minor with subsequent addition of characters (Figs. 1 to 6). It has been suggested that a ratio or value exists where the number of characters is too low to allow tree reconstruction methods to discriminate between different tree topologies (Erdös et al., 1997; Kim, 1998). The simulations performed here indicate that sequence lengths of less than 1000 nt (for the plastid and nuclear genes studied) do not contain enough information to allow MP or NJ to successfully reconstruct the model trees used. Clearly, different data sets of identical length can contain varying amounts of informative characters, but the point made here is a general one. The simulated sequences represent "perfect" data sets that contain almost no constant or uninformative characters and are based on data sets that have not reached saturation and are not severely affected by functional constraints (but see Savolainen et al., 2002, for more details). With real data sets, the appropriate length of sequence will vary according to the DNA region and the organisms studied. However, the plateau reached with each simulation can be seen as the proportion of nodes easily recovered by the different methods, leaving a percentage that could be considered as problematic or even unrecoverable. With less than 5000 nt, the deeper nodes of the trees were more difficult to reconstruct, whereas most of the nodes closer to the tips of the trees required less than 1000 nt in order to be correctly inferred. A similar conclusion was reach by Bininda-Emonds et al. (2000), who showed that shallow nodes were reconstructed with 80% accuracy with as little as 500 characters, whereas deeper nodes were more difficult to infer, requiring far more characters. This trend was observed regardless of the number of taxa present in the trees (Bininda-Emonds et al., 2000). A decrease in accuracy in both methods can be associated with a relative increase in the number of heterogeneous branch lengths as measured by our ratios of parent/daughter branches (Table 4). Our simulations are also consistent with Hillis et al. (2003), and showed that, past a certain point, the addition of characters will not have a major impact on the percentage of nodes correctly recovered by the tree reconstruction method. In many empirical cases, it should be possible to predict, using computer simulations and a given model of evolution, the number of nucleotides beyond which improvement of tree accuracy becomes unlikely. Given the ease of generating simulated data sets and the computer power now available, such analysis can easily be performed, and the resulting prediction would greatly aid the design of empirical studies.

Simulations based on the 18S rDNA suggested that smaller trees containing 141 or 228 taxa were more efficiently reconstructed by the two methods used than larger trees with 357 and 567 taxa. The difference was, however, small, and more than 90% of the nodes were correctly inferred in the two larger trees (Fig. 1B). This difference could indicate that by adding more taxa, the problem of selecting the optimum solution from the tree space is getting harder. This is, however, not always the case, and the smaller 141-taxa tree proved to be much more difficult to reconstruct for NJ and MP than the two larger ones with simulations based on atpB and rbcL (Fig. 2). Adding more taxa to a phylogenetic analysis can be seen as a strategy to reduce the impact of long-branch attraction, thus avoiding the pitfalls of the "Felsenstein zone" (e.g., Purvis and Quicke, 1997). However, in order to be of any use, the additional taxa have to be judiciously selected in order to intercept long branches (Graybeal, 1998). The four angiosperm trees used represent an increasing sample of the flowering plant families, but it is unclear whether the conclusions reached by Graybeal (1998) can be applied to the phylogenetic trees investigated here. The 567-taxa tree represents a more comprehensive picture of angiosperm evolutionary history and provides a more complete framework for testing evolutionary hypotheses. Although caution should be taken with the generalization of our results, this larger tree was neither more nor less difficult to reconstruct than the smaller 141-taxa tree (Fig. 1). Adding taxa might not always be the best solution (Yang and Goldman, 1997), especially when the sampling is heavily biased towards specific clades. But having the more complete range of taxa is certainly an appreciable goal for any evolutionary study, something that should not be dismissed based solely on the complexity of the search through the tree space.

The ratios of parent/daughter branch lengths measured on the three codon positions of atpB and rbcL (Table 4) suggested that the information content present in the first and second codon positions is distributed unevenly across the trees, whereas this trend is less pronounced for the third codon positions, which leads to a higher proportion of smaller ratios in the first two codon positions. Such a situation is likely to pose problems to phylogenetic reconstruction either by creating ambiguities near nodes surrounded by very small branch lengths, or by allowing cases of long branch attraction in regions of the trees where the ratios are very small. Indeed, the simulations showed such a trend (Figs. 4 and 5), and third codon positions were found to be much more able to correctly infer any of our model trees. A similar conclusion was reached by Källersjö et al. (1999). This higher success of third codon positions could be an artifact resulting from the larger proportion of nodes being close to the tips of the tree.

Branch Lengths
The estimated shape of the distribution of branch lengths for each model tree used in the simulations (Table 2) did not show large changes between the different sizes of trees for each data partition. Although larger trees tended to have a larger shape value, which indicates more similar branch lengths, this remained the case whether model trees were based on 18S rDNA or atpB and rbcL (Table 2). An important aspect not taken into account when estimating the global distribution of branch lengths is whether adjacent branches are of similar lengths or not. The ratios of parent/daughter branch lengths attempted to identify whether parts of the model tree were formed by a succession of heterogeneous branch lengths. Better indices would have taken into account the complete quartet surrounding each internal branch. The ratio used here has the advantage of being simple, and because it is measured over all branches of the rooted model tree, each branch of the triplet is taken into account in turn. It can therefore give some, albeit limited, insight on the distribution of branch lengths through the tree. The ratios of parent/daughter branch lengths calculated on each model tree indicates that the model tree for atpB with 141 taxa had a higher proportion of small internal branches giving birth to a daughter branch of at least four times its length than any other model tree considered (56.83%, 49.71%, and 48.04%, for 141-, 357-, and 567-taxa trees, respectively; Table 4). The difference between the 141-model tree and the two other trees was accentuated when the internal branches with daughter branches of at least twice their length were considered (75.73%, 64.11%, and 63.99% for 141-, 357-, and 567-taxa trees, respectively; Table 4). However, the number of internal branches grows when the number of taxa increases, and so should the probability of inconsistently estimating an internal branch (Kim, 1996). This seems to be in contradiction with the results found here for atpB. The problem of heterogeneous branch length was first investigated by Felsenstein (1978) in an unrooted four-taxa tree containing two long and three short branches. Kim (1996) expanded the case to trees with large numbers of taxa. He also considered quartets, but the terminal taxa of Felsenstein (1978) were replaced by subtrees containing any number of terminal taxa. In such a configuration, he showed that the length of the five branches defining the quartet was not of primary importance, but that the total length of the subtree is the determinant factor. This is due to the difficulty of reconstructing accurately the ancestral state at the base of each subtree when the total length increases. Cases where MP is inconsistent were identified even when all the branches of the true tree are of equal length (Kim, 1996). He also showed that for each value of P (defined as the length of the two attracted long branches at opposite end of the internal branch), there was a critical tree size beyond which the model tree became inconsistent (Kim, 1996). For binary characters, if P is less than a critical value of 0.125, approximately, the model tree is consistently estimated regardless of the number of taxa (Kim, 1996). Both factors, the subtree lengths and the lengths of the immediate branches, are therefore important for the method to consistently reconstruct an internal branch.

Towards Recovering the Angiosperm ToL
Almost all families of angiosperms were sampled in the 567-taxa matrix, and the step being currently taken forward by systematists is to obtain sequences for all ca. 13,000 genera of angiosperms. We can therefore ask if it is realistic to build an accurate tree for those taxa with only a few thousand nucleotides sequenced per taxa. Likewise, even building an incomplete ToL will still require the sampling of an immense number of taxa if inconsistencies caused by incomplete taxon sampling are to be avoided (Gauthier et al., 1988; Donoghue et al., 1989; Farris et al., 1996). A large phylogenetic tree for the land plants containing 2538 taxa (Källersjö et al., 1998) has been built using rbcL and parsimony jackknifing (Farris et al., 1996), demonstrating the feasibility of such attempts; however, it is always possible to build a tree, but how to assess its closeness to the "true" tree is unclear. Bininda-Emonds et al. (2000), in their simulations, achieve an accuracy of 80% on an 8192-taxa tree using less than 5000 nt with a substitution rate set to 0.1. A 50 times increase in the rates of substitution reduced the number of characters required to reach the same level of accuracy to less than 2000 nt. The simulations performed here with 13,000-taxa indicate a similar trend (Fig. 6). Although the branch lengths assigned to the 13,000-taxa topology in the simulations are probably not representative of real biological branch lengths (except for the initial distribution used), MP correctly inferred 82% of the nodes with a sequence length of 15,000 nt, viz. roughly the equivalent of 10 gene sequences the length of rbcL (Fig. 6). The rate of evolution was based on the 18S rDNA, a slow-evolving gene. This could explain the three-fold increase in number of characters needed in our simulations in comparisons to Bininda-Emonds et al. (2000). The heuristic search option used was crude, and it is likely that more thorough strategies would increase the percentages of tree correct to the extant that the search can be performed in a reasonable time (with current computer power, TBR swapping was tried without swapping on multiple trees but the search did not run to completion after 190 hours of CPU time on the 32 nodes NetFinity cluster). The extrapolation obtained from the percentages of tree correctly inferred by MP suggested that between 100,000 and 150,000 nt could be required for such a large tree to be correctly reconstructed at every node. Such a sequencing effort should be readily achievable with application of genomic approaches, and the tree reconstruction procedure would require more efficient and faster computational algorithms (e.g., Yang and Rannala, 1997; Ronquist, 1998; Larget and Simon, 1999; Quicke et al., 2001; Vos, 2003; Guindon and Gascuel, 2003).

In our analyses, NJ was performed on distance matrices obtained from a model of evolution that matched the one used to simulate the data sets, therefore taking into account the potential multiple hits that would impair MP searches (Felsenstein, 1978; Swofford et al., 1996). However, MP was found to give slightly better tree correct percentages than NJ in most circumstances. One potential problem affecting NJ was demonstrated by Strimmer and von Haeseler (1996), who showed that accuracy and, more importantly, for these simulations, average similarity between a model tree and the NJ tree decreases with an increase in number of taxa. With an increasing number of taxa, the pairwise distances will have a larger variance. Such large variance will then have an effect on the accuracy of the distance estimates used to choose the pair of taxa to be agglomerated during the next steps of the NJ algorithm. Using a method such as BioNJ or "balanced" Minimum Evolution (Gascuel, 1997; Desper and Gascuel, 2002), which will minimize the variance of the distance matrix, is likely to reduce this problem. Although the simulations done on matrices up to 567 taxa were performed much more quickly with NJ than with the heuristic search used for MP, this was not the case with the 13,000-taxa tree. The NJ algorithm computed a tree in a time that is function of the number of taxa present in the distance matrix; therefore, the problem encountered here comes from the time spent in computing the distance matrices in the NJ analysis (the same will apply to any distance-based method or parsimony methods that incorporate a complex weighting scheme). It is noteworthy, in addition to the much longer computational time required, that computer memory was the major restriction for the distance method. For a 13,000-taxa matrix, this amounts to calculate almost 84.5 million distances and storing them before starting any tree-building procedures. Given that a single precision rational number is represented by 32 bits in most computer systems, approximately 350 Mb of RAM will be required to store the distance matrix alone if it is stored internally by the software. By adding to this the space required for the tree structure and other internal book-keeping done by the phylogeny program itself, the operating systems and other residing programs in the memory, such searches could not be performed on current desktop computers such as eMac G4 with 800 Mb of RAM or PC running SuSE Linux on an Intel Pentium III processor and 512 Mb of RAM. The IBM 32-node cluster was the only solution to perform these searches. This was the case whether HKY85+{Gamma} or K2P distances were selected. On the other hand, MP searches were still possible on the eMac G4. Such limitations due to computer capacities are, of course, likely to diminish in the near future.

We conclude that simple heuristic searches containing as many as 13,000 taxa can give relatively good results and infer correctly more than two thirds of the nodes from model trees with data sets greater than 10,000 informative characters. Before computer power increases and before new and more efficient generations of DNA sequencing techniques and tree search algorithms have become popular, it would be realistic to start building a complete tree of angiosperm genera in the near future. Indeed, if simple tree searches with a few gene sequences can recover as much as ca. 80% of the tree, then our simulations support the recent views that it is worth funding phylogenetic activities of this scale within the ToL framework. A comprehensive section of the ToL for angiosperm genera is certainly feasible within the next decade.


    ACKNOWLEDGMENTS
 Top
 Abstract
 MATERIALS AND METHODS
 RESULTS
 DISCUSSION AND CONCLUSION
 ACKNOWLEDGMENTS
 REFERENCES
 
We thank Scott Edwards, Olaf Bininda-Emonds, Douglas Soltis, Mark Holder, and Chris Simon for their comments on the manuscript and Joe Felsenstein and Mary Kuhner for helpful discussion. We thank the Society of Systematic Biologists for awarding the Ernst Mayr prize to NS for his presentation of the results of this paper at the 2003 SSB/SSE conference. This work was funded by the Irish Higher Education Authority and an Enterprise Ireland Basic Research Grant (EI-SC/2003/437) (TRH), the Roche Research Foundation (VS), and a University of Lausanne research grant (NS).


    REFERENCES
 Top
 Abstract
 MATERIALS AND METHODS
 RESULTS
 DISCUSSION AND CONCLUSION
 ACKNOWLEDGMENTS
 REFERENCES
 

    Adkins R. M., Walton A. H., Honeycutt R. L. Higher-level systematics of rodents and divergence time estimates based on two congruent nuclear genes. Mol. Phylogenet. Evol. (2003) 26:409–420.[CrossRef][Web of Science][Medline]

    Baldauf S. L. The deep roots of eukaryotes. Science (2003) 300:1703–1706.[Abstract/Free Full Text]

    Bininda-Emonds O. R. P., Brady S. G., Kim J., Sanderson M. J. Scaling of accuracy in extremely large phylogenetic trees. In: Pacific Symposium on Biocomputing 6—Altman R. B., Dunker A. K., Hunter L., Lauderdale K., Klein T. E., eds. (2000) River Edge, New Jersey: World Scientific Publishing Company. Pages 547–558.

    Bininda-Emonds O. R. P., Gittleman J. L., Steel M. A. The (Super)Tree of Life: Procedures, problems, and prospects. Annu. Rev. Ecol. Syst. (2002) 33:265–289.[CrossRef][Web of Science]

    Chase M. W., Cox A. V. Gene sequences, collaboration, and analysis of large data sets. Austr. Syst. Bot. (1998) 11:215–229.[CrossRef]

    Desper R., Gascuel O. Fast and accurate phylogeny reconstruction algorithms based on the minimum evolution principle. J. Comp. Biol. (2002) 9:687–705.[CrossRef]

    Donoghue M. J., Doyle J. J., Gauthier J., Kluge A. G., Rowe T. The importance of fossils in phylogeny reconstruction. Annu. Rev. Ecol. Syst. (1989) 20:431–460.[CrossRef][Web of Science]

    Eisen J. A., Fraser C. M. Phylogenomics: Intersection of evolution and genomics. Science (2003) 300:1706–1707.[Abstract/Free Full Text]

    Erdös P. L., Steel M. A., Székely L. A., Warnow T. J. Local quartet splits of a binary tree infer all quartet splits via one dyadic inference rule. Comput. Artif. Intell. (1997) 16:217–227.

    Farris J. S., Albert V. A., Kallersjo M., Lipscomb D., Kluge A. G. Parsimony jackknifing outperforms neighbor-joining. Cladistics (1996) 12:99–124.[CrossRef][Web of Science]

    Felsenstein J. Cases in which parsimony and compatibility methods will be positively misleading. Syst. Zool. (1978) 27:401–410.[Abstract/Free Full Text]

    Fusco G., Cronk Q. C. B. A new method for evaluating the shape of large phylogenies. J. Theor. Biol. (1995) 175:235–243.[CrossRef][Web of Science]

    Gascuel O. BioNJ: An improved version of the NJ algorithm based on a simple model of sequence data. Mol. Biol. Evol. (1997) 14:685–695.[Abstract]

    Gaut B. S., Lewis P. O. Success of maximum-likelihood phylogeny inference in the 4-taxon case. Mol. Biol. Evol. (1995) 12:152–162.[Abstract]

    Gauthier J., Kluge A. G., Rowe T. Amniote phylogeny and the importance of fossils. Cladistics (1988) 12:152–162.

    Graybeal A. Is it better to add taxa or characters to a difficult phylogenetic problem? Syst. Biol. (1998) 47:9–17.[Abstract/Free Full Text]

    Guindon S., Gascuel O. A simple, fast, and accurate algorithm to estimate large phylogenies by maximum likelihood. Syst. Biol. (2003) 52:696–704.[Abstract/Free Full Text]

    Hasegawa M., Kishino H., Yano T. Dating of the human-ape splitting by a molecular clock of mitochondrial DNA. J. Mol. Evol. (1985) 21:160–174.

    Hendy M. D., Penny D. A framework for the quantitative study of evolutionary trees. Syst. Zool. (1989) 38:297–309.[Abstract/Free Full Text]

    Hillis D. M. Inferring complex phylogenies. Nature (1996) 383:130–131.[CrossRef][Medline]

    Hillis D. M. Taxonomic sampling, phylogenetic accuracy, and investigator bias. Syst. Biol. (1998) 47:1–8.[Free Full Text]

    Hillis D. M., Huelsenbeck J. P., Cunningham C. W. Application and accuracy of molecular phylogenies. Science (1994) 264:671–677.[Abstract/Free Full Text]

    Hillis D. M., Pollock D. D., McGuire J. A., Zwickl D. J. Is sparse taxon sampling a problem for phylogenetic inference. Syst. Biol. (2003) 52:124–126.[Free Full Text]

    Huelsenbeck J. P. Performance of phylogenetic methods in simulation. Syst. Biol. (1995) 44:17–48.[Abstract/Free Full Text]

    Huelsenbeck J. P., Lander K. M. Frequent inconsistency of parsimony under a simple model of cladogenesis. Syst. Biol. (2003) 52:641–648.[Abstract/Free Full Text]

    Huelsenbeck J. P., Ronquist F. MrBayes: Bayesian inference of phylogenetic trees. Bioinformatics (2001) 17:754–755.[Abstract/Free Full Text]

    Källersjö M., Albert V. A., Farris J. S. Homoplasy increases phylogenetic structure. Cladistics (1999) 15:91–93.[CrossRef][Web of Science]

    Källersjö M., Farris J. S., Chase M. W., Bremer B., Fay M. F., Humphries C. J., Petersen G., Seberg O., Bremer K. Simultaneous parsimony jackknife analysis of 2538 rbcL DNA sequences reveals support for major clades of green plants, land plants, seed plants and flowering plants. Plant Syst. Evol. (1998) 213:259–287.[CrossRef]

    Kim J. General inconsistency conditions for maximum parsimony: Effects of branch lengths and increasing numbers of taxa. Syst. Biol. (1996) 45:363–374.[Abstract/Free Full Text]

    Kim J. Large-scale phylogenies and measuring the performance of phylogenetic estimators. Syst. Biol. (1998) 47:43–60.[Abstract/Free Full Text]

    Kimura M. A simple method for estimating evolutionary rate of base substitutions through comparative studies of nucleotide sequences. J. Mol. Evol. (1980) 16:111–120.[CrossRef][Web of Science][Medline]

    Larget B., Simon D. L. Markov chain Monte Carlo algorithms for the Bayesian analysis of phylogenetic trees. Mol. Biol. Evol. (1999) 16:750–759.[Web of Science]

    Mabberley D. J. The plant-book: A portable dictionary of the vascular plants. (1993) 2nd edition. Cambridge: Cambridge University Press.

    Mace G. M., Gittleman J. L., Purvis A. Preserving the Tree of Life. Science (2003) 300:1707–1709.[Abstract/Free Full Text]

    Miadlikowska J., Lutzoni F. Phylogenetic revision of the genus Peltigera (lichen-forming Ascomycota) based on morphological, chemical and large subunit nuclear ribosomal DNA data. Int. J. Plant Sci. (2000) 161:925–958.[CrossRef][Web of Science]

    Miya M., Nishida M. Major patterns of actinopterygian phylogenies: A new perspective based on > 200 complete mitochondrial DNA sequences. Integr. Comp. Biol. (2002) 42:1280.

    Omilian A. R., Taylor D. J. Rate acceleration and long-branch attraction in a conserved gene of cryptic Daphniid (Crustaceae) species. Mol. Biol. Evol. (2001) 18:2201–2212.[Abstract/Free Full Text]

    Purvis A., Quicke D. L. J. Building phylogenies: Are the big easy? Trends Ecol. Evol. (1997) 12:49–50.[CrossRef]

    Qiu Y.-L., Lee J., Bernasconi-Quadroni F., Soltis D. E., Soltis P. S., Zanis M., Chen Z., Savolainen V., Chase M. W. Phylogeny of basal angiosperms: Analysis of five genes from three genomes. Int. J. Plant Sci. (2000) 161:S3–S27.[CrossRef][Web of Science]

    Quicke D. L. J., Taylor J., Purvis A. Changing the landscape: A new strategy for estimating large phylogenies. Syst. Biol. (2001) 50:60–66.[Abstract/Free Full Text]

    Rannala B., Huelsenbeck J. P., Yang Z. H., Nielsen R. Taxon sampling and the accuracy of large phylogenies. Syst. Biol. (1998) 47:702–710.[Free Full Text]

    Rannala B., Yang Z. H. Probability distribution of molecular evolutionary trees: A new methods of phylogenetic inference. J. Mol. Evol. (1996) 43:304–311.[Web of Science][Medline]

    Ronquist F. Fast Fitch—Parsimony algorithms for large data sets. Cladistics (1998) 14:387–400.[CrossRef][Web of Science]

    Salamin N., Chase M. W., Hodkinson T. R., Savolainen V. Assessing internal support with large phylogenetic DNA matrices. Mol. Phylogenet. Evol. (2003) 27:528–539.[CrossRef][Web of Science][Medline]

    Salamin N., Hodkinson T. R., Savolainen V. Building supertrees: An empirical assessment using the grass family (Poaceae). Syst. Biol. (2002) 51:112–126.[Web of Science]

    Sanderson M. J., Driskell A. C. The challenge of constructing large phylogenetic trees. Trends Plant Sci. (2003) 8:374–379.[CrossRef][Web of Science][Medline]

    Savolainen V., Chase M. W., Morton C. M., Hoot S. B., Soltis D. E., Bayer C., Fay M. F., deBruijn A., Sullivan S., Qiu Y.-L. Phylogenetics of flowering plants based upon a combined analysis of plastid atpB and rbcL gene sequences. Syst. Biol. (2000a) 49:306–362.[Abstract/Free Full Text]

    Savolainen V., Chase M. W., Salamin N., Soltis D. E., Soltis P. S., Lopez A. J., Fedrigo O., Naylor G. J. P. Phylogeny reconstruction and functional constraints in organellar genomes: Plastid atpB and rbcL sequences versus animal mitochondrion. Syst. Biol. (2002) 51:638–647.[Free Full Text]

    Savolainen V., Fay M. F., Albach D. C., Backlund A., van der Bank M., Cameron K. M., Johnson S. A., Lledó M. D., Pintaud J.-C., Powell M., Sheahan M. C., Soltis D. E., Soltis P. S., Weston P., Whitten W. M., Wurdack K. J., Chase M. W. Phylogeny of the eudicots: A nearly complete familial analysis based on rbcL gene sequences. Kew Bull. (2000b) 55:257–309.[CrossRef]

    Semple C., Steel M. A supertree method for rooted trees. Discrete Appl. Math. (2000) 105:147–158.[CrossRef]

    Soltis D. E., Soltis P. S., Chase M. W., Mort M. E., Albach D. C., Zanis M., Savolainen V., Hahn W. H., Hoot S. B., Fay M. F., Axtell M., Swensen S. M., Prince L. M., Kress W. J., Nixon K. C., Farris J. S. Angiosperm phylogeny inferred from 18S rDNA, rbcL, and atpB sequences. Bot. J. Linnean Soc. (2000) 133:381–461.[CrossRef][Web of Science]

    Soltis D. E., Soltis P. S., Nickrent D. L., Johnson L. A., Hahn W. J., Hoot S. B., Sweere J. A., Kuzoff R. K., Kron K. A., Chase M. W., Swensen S. M., Zimmer E. A., Chaw S. M., Gillespie L. J., Kress W. J., Sytsma K. J. Angiosperm phylogeny inferred from 18S ribosomal DNA sequences. Ann. Mo. Bot. Gard. (1997) 84:1–49.[CrossRef]

    Steel M. Sufficient conditions for two tree reconstruction techniques to succeed on sufficiently long sequences. SIAM J. Discrete Math. (2001) 14:36–48.[CrossRef]

    Stork N. E. Measuring global diversity and its decline. In: Biodiversity II: Understanding and protecting our biological resources—Reaka-Kudla M., Wilson D. E., Wilson E. O., eds. (1997) Washington, DC: Joseph Henry Press. Pages 41–68.

    Strimmer K., von Haeseler A. Accuracy of neighbor joining for n-taxon trees. Syst. Biol. (1996) 45:516–523.[Abstract/Free Full Text]

    Sullivan J., Swofford D. L. Are guinea pigs rodents? The importance of adequate models in molecular phylogenetics. J. Mamml. Evol. (1997) 4:77–86.[CrossRef]

    Swofford D. L. PAUP*4. Phylogenetic analysis using parsimony (*and other methods). (2000) Sunderland, Massachusetts: Sinauer Associates.

    Swofford D. L., Olsen G. K., Waddell P. J., Hillis D. M. Phylogeny reconstruction. In: Molecular systematics—Hillis D. M., Moritz C., Mable B. K., eds. (1996) Sunderland, Massachusetts: Sinauer Associates. Pages 407–514.

    Telford M. J., Lockyer A. E., Cartwright-Finch C., Littlewood D. T. J. Combined large and small subunit ribosomal RNA phylogenies support a basal position of the acoelomorph flatworms. Proc. R. Soc. Lond. B (2003) 270:1077–1083.[Abstract/Free Full Text]

    Vos R. A. Accelerated likelihood surface exploration: The likelihood ratchet. Syst. Biol. (2003) 52:368–373.[Abstract/Free Full Text]

    Yang Z. H. Maximum likelihood phylogenetic estimation from DNA sequence with variable rates over sites: Approximate methods. J. Mol. Evol. (1994) 39:306–314.[CrossRef][Web of Science][Medline]

    Yang Z. H. On the best evolutionary rate for phylogenetic analysis. Syst. Biol. (1998) 47:125–133.[Abstract/Free Full Text]

    Yang Z. H., Goldman N. Are big trees indeed easy? Trends Ecol. Evol. (1997) 12:357.

    Zanis M. J., Soltis P. S., Qiu Y. L., Zimmer E. A., Soltis D. E. Phylogenetic analyses and perianth evolution in basal angiosperms. Ann. Mo. Bot. Gard. (2003) 90:129–150.[CrossRef]


Add to CiteULike CiteULike   Add to Connotea Connotea   Add to Del.icio.us Del.icio.us    What's this?


This article has been cited by other articles:


Home page
Syst BiolHome page
M. M. McMahon and M. J. Sanderson
Phylogenetic Supermatrix Analysis of GenBank Sequences from 2228 Papilionoid Legumes
Syst Biol, October 1, 2006; 55(5): 818 - 836.
[Abstract] [Full Text] [PDF]


This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (PDF) Freely available
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrow Search for citing articles in:
ISI Web of Science (7)
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Salamin, N.
Right arrow Articles by Savolainen, V.
Right arrow Search for Related Content
PubMed
Right arrow Articles by Salamin, N.
Right arrow Articles by Savolainen, V.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?