© 2006 Society of Systematic Biologists
SDM: A Fast Distance-Based Approach for (Super)Tree Building in Phylogenomics
Edited by Olaf Bininda-Emonds: Associate Editor
1 Groupe Phylogénie Moléculaire, ISEM, Université Montpellier 2, CC 064 34095, Montpellier Cedex 05, France
2 Equipe Méthodes et Algorithmes pour la Bioinformatique, LIRMM (CNRS, Université Montpellier 2) 161 rue Ada, 34392, Montpellier Cedex 05, France E-mail: gascuel{at}lirmm.fr (O.G.)
| Abstract |
|---|
|
|
|---|
Phylogenomic studies aim to build phylogenies from large sets of homologous genes. Such "genome-sized" data require fast methods, because of the typically large numbers of taxa examined. In this framework, distance-based methods are useful for exploratory studies and building a starting tree to be refined by a more powerful maximum likelihood (ML) approach. However, estimating evolutionary distances directly from concatenated genes gives poor topological signal as genes evolve at different rates. We propose a novel method, named super distance matrix (SDM), which follows the same line as average consensus supertree (ACS; Lapointe and Cucumel, 1997) and combines the evolutionary distances obtained from each gene into a single distance supermatrix to be analyzed using a standard distance-based algorithm. SDM deforms the source matrices, without modifying their topological message, to bring them as close as possible to each other; these deformed matrices are then averaged to obtain the distance supermatrix. We show that this problem is equivalent to the minimization of a least-squares criterion subject to linear constraints. This problem has a unique solution which is obtained by resolving a linear system. As this system is sparse, its practical resolution requires O(na ka) time, where n is the number of taxa, k the number of matrices, and a < 2, which allows the distance supermatrix to be quickly obtained. Several uses of SDM are proposed, from fast exploratory studies to more accurate approaches requiring heavier computing time. Using simulations, we show that SDM is a relevant alternative to the standard matrix representation with parsimony (MRP) method, notably when the taxa sets of the different genes have low overlap. We also show that SDM can be used to build an excellent starting tree for an ML approach, which both reduces the computing time and increases the topogical accuracy. We use SDM to analyze the data set of Gatesy et al. (2002, Syst. Biol. 51: 652–664) that involves 48 genes of 75 placental mammals. The results indicate that these genes have strong rate heterogeneity and confirm the simulation conclusions.
Keywords: Distance method; evolutionary distances; MRP; phylogenomics; supermatrix; supertree; total evidence
Received September 30, 2005; Revised December 10, 2005; Accepted April 12, 2006
Phylogenomics, whereby phylogenies are built from large sets of genes, is currently a popular trend that benefits from the increased quantity of sequenced genes within a huge variety of organisms (Daubin et al., 2002; Gatesy et al., 2002; Eisen and Fraser, 2003; Driskell et al., 2004; Philippe et al., 2004, 2005; Devulder et al., 2005). One of the main difficulties in phylogenomics is that fast methods are required to process the large collections of taxa and genes. Missing data are another difficulty with such data sets, as some genes or species are less represented in databases. Numerous approaches have been proposed to deal with this problem (Bininda-Emonds, 2004); they can be classified into three main categories (Schmidt, 2003; Chap. 7):
- The low-level (or total evidence) methods concatenate all genes to obtain a single alignment, also called supermatrix of characters, which is then analyzed using standard phylogeny reconstruction algorithms. As some genes are missing for some taxa, supermatrices usually contain numerous missing characters (e.g., > 90% in Driskell et al., 2004). The various phylogenetic methods used to analyze such supermatrices are more or less vulnerable to missing characters, but the probabilistic ones seem to be not much affected and still provide accurate trees with sparse data (Philippe et al., 2004). Genes evolve under different constraints, and heterogeneity of rates and of evolutionary modes can also be problematic (Yang, 1996; Pupko et al., 2002). Again, probabilistic methods (e.g., MrBayes, Huelsenbeck and Ronquist, 2001) provide ways to circumvent this difficulty, by allowing for different substitution models to be defined among genes (or among codon positions). However, computing time is a main issue, specially with most sophisticated (e.g., Bayesian) approaches.
- The high-level methods arrange in a single tree topological information contained in the set of phylogenies inferred from each gene. Those source phylogenies are inferred independently, possibly using different evolutionary models, and may as well be derived from other (e.g., morphological or transposon-based) data types, which total evidence methods hardly account for. As some genes are missing for some taxa, the different phylogenies are defined on partially overlapping sets of taxa. This generalization of the consensus tree (Bryant, 2001) is called the supertree problem (Bininda-Emonds, 2004). Matrix representation with parsimony (MRP) (Baum, 1992; Ragan, 1992) is the most popular method to deal with this problem. MRP involves coding the topological information of every source tree in a single matrix of partial binary characters, which is then analyzed using parsimony to infer the supertree. This approach has been refined in various ways, such as weighted MRP (Ronquist, 1996) and matrix representation with flipping (Eulenstein et al., 2004). Numerous other combinatorial approaches have been proposed to deal with the supertree problem (Bininda-Emonds, 2004), including the MinCut (Semple and Steel, 2000) and modified MinCut MinCut (Page, 2002) algorithms.
- The medium-level methods involve an intermediary gene analysis stage, between simple gene concatenation and complete tree inference. Numerous solutions do exist to extract information from every single gene, without inferring the complete tree as in high-level methods. The main idea is to extract in a fast way elementary pieces of information from each gene independently, then to combine all these elements for all genes together. The hard combination task is thus performed just once with all genes being accounted for. As the first analysis stage is performed independently for each gene (or information source), these methods offer simple ways to accomodate for genes evolving under different evolutionary constraints or to combine heteregeneous data types. A good example is the quartet approach (Strimmer and von Haeseler, 1996; Schmidt et al., 2002; Piaggio-Talice et al., 2004) whereby every quartet topology is inferred using maximum likelihood from each gene before combining them in a single tree. We shall see in this paper a second example where first analysis stage involves computing for each gene the evolutionary distance between every taxon pair.
The outline of such large categories is blurred and some methods can be seen as intermediary. For example, the (medium-level) quartet approach has been proposed by several authors (Strimmer and von Haeseler, 1996; Schmidt et al., 2002) to deal with the (high-level) supertree problem, and the divide-and-conquer searching methods (e.g., Huson et al., 1999) use a (high-level) tree combination approach to solve the low-level problem. Moreover, the criterion that the method seeks to optimize gives another important point of view. Most practical methods are based on maximum parsimony (MP) and maximum likelihood (ML). However, one aim of phylogenomics is to build large phylogenies from large gene collections. Therefore, it is essential to be able to process huge data sets by low time-consuming methods. The distance-based approach is the first choice from this standpoint. Using fast algorithms such as NJ (Saitou and Nei, 1987; Studier and Keppler, 1988), BIONJ (Gascuel, 1997), or FASTME(Desper and Gascuel, 2002), trees with thousands of taxa can be inferred in a few minutes on a standard computer. Moreover, these algorithms are fairly accurate, though not as accurate as likelihood-based approaches. This computational efficiency is why distance-based methods are frequently employed in exploratory studies. They are also used to provide starting trees for procedures aimed at optimizing more time-consuming criteria. The PhyML program (Guindon and Gascuel, 2003) is a good example of this approach with respect to the ML criterion.
Paradoxically, few distance-based approaches have been proposed in phylogenomics. One simple method is to directly estimate pairwise evolutionary distances from the concatenated matrix of characters. For example, Paup*'s (Swofford, 2002) option MissDist = Ignore only takes sites that have no missing value in the two sequences into account. This procedure is named distance-based total evidence (DTE) in the following and is obviously limited by large amounts of missing data and severe rate heterogeneity. A second method, the average consensus supertree (ACS) procedure, was proposed by Lapointe and Cucumel (1997) to deal with the supertree problem, where there can be large amounts of missing data. The first step is to compute the path-length distance matrices corresponding to the source trees. Each source matrix is then standardized, and ACS computes the average of the standardized matrices to produce the distance supermatrix that is analyzed using a least-squares method. ACS has been shown to be the same as MRP in the consensus setting with unitary branch lengths, but both are different in the more general supertree context (Lapointe et al., 2003). A similar averaging method was used by Lapointe et al. (1999) and Levasseur and Lapointe (2001) to compare and combine various distance matrices being obtained directly (the medium-level way) from sequences or from DNA hybridization, or corresponding to (high-level) gene trees. The standardization step proposed by Lapointe and Cucumel (1997) involves dividing all distances in each matrix by the maximum distance in that matrix. Other standardization methods have been explored, but they seem to be inaccurate with more than two trees and Lapointe and Levasseur (2004) concluded that "other ways of scaling path-length distance matrices need to be investigated when combining more than two trees of varying size" (p. 100). Recently, Creevey and McInerney (2005) proposed another distance-based method to the supertree problem, named most similar supertree (MSS). The unitary (every branch has length 1) path-length distance matrices corresponding to the source trees are first computed; then, MSS searches for the supertree that best represents these matrices using topological rearrangements.
Here, we propose a novel distance method, which follows the same line as ACS but is based on a much more involved standardization procedure that answers the limitations outlined by Lapointe and Levasseur (2004). This method, named super distance matrix (SDM), first deforms the source matrices without modifying their topological message, so as to bring them as close as possible to each other; then, just as with ACS, so-deformed matrices are averaged and analyzed by usual tree-building algorithms. Simulations show that SDM deals efficiently and accurately with collections containing a large number of source matrices of varying size. SDM was initially designed as a medium-level method (source distance matrices are directly computed from the sequences of each gene), but it is an effective alternative within high-level scenarios (source matrices are obtained from the gene trees, just as with ACS) and within low-level scenarios, where good starting trees are obtained thanks to its speed.
In the following, we first describe the principle and the main features of the SDM algorithm; we then provide comparisons of SDM to other gene combination techniques using simulations; we further compare SDM to other approaches using a phylogenomics data set of placental mammals (Gatesy et al., 2002). Mathematical proofs and equations are provided in the Appendix.
| The SDM Method |
|---|
|
|
|---|
Notations and Definitions
Let
ij1), (
ij2), ..., (
ijp), ..., (
ijk) } be a collection of k distance matrices (with no missing entries), where
ijp is the evolutionary distance between taxa i and j for the gene p.
ijp); np is the size of
p
|
|
ijp) closer (in the least-squares sense) relative to the others.
Method
Assume a high-level context and let Tp be the tree corresponding to gene p. (
ijp) is then additive and is obtained from Tp by computing the path-length for every ij pair. (
ijp) is equivalent to Tp as Tp can be unambigously recovered from (
ijp). It is well known (Barthélemy and Guénoche, 1991) that multiplication by a factor
p > 0 to obtain a new distance matrix (
p
ijp) does not change the topology of Tp. This operation is equivalent to multiplying every branch length of Tp by
p. Similarly, it is easily shown that the addition of a constant aip
0 to each of the 2 (np – 1) nondiagonal distances corresponding to taxon i does not change the topology of Tp. This operation is equivalent to elongating, by length aip, the external branch corresponding to taxon i. This addition can be performed independently for every taxon, and both multiplication and addition operations can be combined to obtain the new matrix (
p
ijp + aip + ajp) that contains the same topological information as (
ijp).
In the medium-level context, distance matrices are estimated from sequences (or other data) and are not additive (i.e., do not exactly correspond to a tree). But the above property still holds in some sense. Indeed, it is easily shown (Gascuel, 1994) that NJ and a number of related algorithms infer the same topology from the original distance matrix (
ijp) and from the deformed one (
p
ijp + aip+ ajp). This is still true when using any of the algorithms implemented in the FastME package (Desper and Gascuel, 2002). Moreover, simulation experiments show that least-squares algorithms, e.g., Fitch from the Phylip package (Felsenstein, 1993) and MW (Makarenkov and Leclerc, 1999) from the Trex program (Makarenkov, 2001), are almost insensitive to multiplication and addition operations.
The different ACS standardization methods (Lapointe and Levasseur, 2004) all correspond to the use of the multiplication operation to rescale matrices before averaging them. SDM uses both multiplication and addition, which greatly increases flexibility as multiplication involves one free parameter per source matrix, whereas addition involves one parameter per taxon for each source matrix. The basic principle of SDM is to deform each of the k distance matrices (
ijp) by multiplying them by a positive factor
p and adding constants aip in order to bring them as close as possible to each other in the least-squares sense. All distances that are shared by at least two matrices of
(i.e., such that kij
2) are taken into account in the computation of deformation parameters. Moreover, weights (wp) are associated to each of the source matrices to give them a confidence value (see below for more). Thus, for every pair ij such that kij
2, we aim at minimizing the variance term:
|
| (1) |
|
| (2) |
p and aip parameters have been computed. By minimizing criterion (1), we bring closer the
pij distances, and we obtain a reliable estimation of their average that defines the SDM superdistance.
As said above, wp weights allow to give a confidence value to each of the source matrices. Typically, the variance of any distance estimate is inversely proportional to the length of the sequences used for estimation (Nei and Jin, 1989). Thus, it is coherent to set wp to be equal to the sequence length, which is denoted as
p. On the other hand, matrices involving few taxa might have a poor influence in comparison to matrices with many taxa. To compensate for this effect, we can use
p/[ñp (ñp – 1)], or the intermediate value
p/ñp as the matrix weight. A number of other weightings are possible, and SDM is easily extended to the case where each distance is associated to a confidence value, just as in weighted least-squares tree building methods (Fitch and Margoliash, 1967).
With the minimization of Vij being applied for every relevant pair of taxa ij, SDM thus involves minimizing the sum:
|
| (3) |
Several linear constraints on the variables are associated with minimization of criterion (3). The
p factors deal with the various evolutionary rates of each gene in a similar way as the gene-specific rate models first suggested by Yang (1996). Constraint (4), identical to that of the proportional model of Pupko et al. (2002), forces the
p factors to be equal on average to 1.0:
|
| (4) |
p = 0,
p.
External branches of a phylogeny are generally longer than the internal branches. Consequently, most of the variance of each pairwise distance is generally supported by the two external branches. Moreover, Lapointe and Levasseur (2004) noticed that high heterogeneity in the branch lengths of source trees deteriorates ACS topological accuracy. The aip variables thus try to normalize the external branch lengths in the various matrices. Constraint (2) forces, for each taxon i, the sum of aip to be equal to 0 and forbids overelongation (or shortening) of external branch lengths corresponding to taxon i:
|
| (5) |
ijp), the sum of aip to be equal to zero:
|
| (6) |
p values. The topological signal of the original matrix would then be stifled, as was experimentally observed (in the absence of constraint (6)) with matrices having few taxa and low (or contradictory) signal. Note that constraint
i
Minimization of criterion (3) involves calculating its partial derivatives for each of the k +
pñp variables
p and aip, adding Lagrange multipliers (Luenberger, 1984: Chap. 10) that are associated with each of the ñ+ k linear constraints. We thus obtain a linear system defined by O(ñk) variables and equations which has a unique solution (see Appendix). Resolving this system has O(ñ3k3) time complexity, which is theoretically equivalent to the running time of the NJ algorithm with a distance matrix of size ñk. However, as this linear system is very sparse, the practical time required to solve it is much lower (see below) using an appropriate library (MTJ, available at https://mtj.dev.java.net/+, is used in our implementation).
Let
p* and aip* be the optimal values of the parameters we obtain this way, then the SDM distance supermatrix (
ijSDM) is defined by:
|
|
1), not only to those used to compute the optimal parameter values (i.e. kij
2). This last step of the SDM approach is fast and requires O(n2k) running time. To check the SDM practical running time, we generated 100 collections of distance matrices with k = 4, 8, 12, 16 and n from 50 to 500 and then measured the running time t of SDM. Assuming t = b(nk)a, we performed a linear regression on log (t) as a function of log (nk) and found that the estimated value of a is below 2.0. Thus, in practice, SDM requires computing time that is at most quadratic in nk. For example, it only takes a few minutes to deal with a collection of distance matrices with n = 500 and k = 20, using a 1.8 GHz Pentium IV PC with 1 Gb RAM.
Phylogenetic Reconstruction Using SDM
A distance-based algorithm is applied to the SDM distance supermatrix to obtain a phylogeny. However, just as with ACS, missing entries may occur in this distance supermatrix depending on the extent of taxon overlap within the source matrices. It has been shown (Farach et al., 1995) that tree reconstruction from distance matrices with missing entries is computationally hard, and heuristics approaches have to be used. Two types of method have been proposed:
- The indirect method involves first estimating missing distances by applying an ultrametric (De Soete, 1984), additive (Landry et al., 1996), decomposition-based (Lapointe and Landry, 2001), or quartet-based (Guénoche and Grandcolas, 1999) completion algorithm. The Trex package (Makarenkov, 2001) provides several implementations of such algorithms to be used before tree building using any standard method with the completed matrix.
- The direct method involves using a weighted least-squares algorithm and associating missing distances with null weight, which means that missing distances are simply discarded from weighted least-squares computations (Swofford et al., 1996: 449). The MWmodif algorithm (Makarenkov and Leclerc, 1999) from Trex and the Fitch algorithm (Felsenstein, 1997) from the Phylip package (Felsenstein, 1993) implement this technique.
A combination of both direct and indirect methods is provided by MW* (Makarenkov and Lapointe, 2004) (also available in Trex); this algorithm first applies an ultrametric or additive completion algorithm (depending on the density of missing distances) and then infers a tree using the weighted least-squares algorithm MW (Makarenkov and Leclerc, 1999), where weights are set at 1.0 for known distances, 0.5 for estimated distances, and 0.0 for missing distances (if any remain).
However, missing distances are relatively rare, though the amount of missing characters is usually high in the gene collections that are commonly used in phylogenomics studies. For example, data sets of Gatesy et al. (2002) and Philippe et al. (2004, 2005) have high ratio of missing character states (about 68%, 25%, 35%, respectively) but do not produce any missing distances when using SDM. Indeed, in these data sets some genes (e.g., cytochrome b and ribosomic protein L10) have been sequenced for all taxa; at least one gene distance matrix is then complete, which induces that the SDM supermatrix is also complete. Moreover, it is a simple consequence of randomness that the number of missing distances tends to decrease when the number of genes increases. For example, with the two very large data sets of Driskell et al. (2004), which were collected from Swiss-Prot and GenBank thanks to a computer program (previous collections were collected manually), the ratio of missing distances is
19% and
1%, whereas the ratio of missing characters is
92% and
87%, respectively. In the same way, in our simulations study (see below), missing distances are very rarely observed when the number of genes is above 10 and when the ratio of missing characters (equal in expectation to the taxon deletion rate) is of 25%. When the SDM distance supermatrix is complete, fast algorithms (e.g., NJ, BioNJ, or FastME) can be used to infer the corresponding tree.
| Simulation Protocol |
|---|
|
|
|---|
We conducted large-scale simulations to evaluate the topological accuracy of SDM. Our aim was to compare the ability to recover the correct topology and the running times of low-, high-, and medium-level approaches. In the three cases, we present standard methods that are compared to SDM-based scenarios. Moreover, we emphasize distance-based methods as SDM belongs to this category. We first describe the way trees and sequences were generated, then the various methods we tested, and finally the criteria we used in the comparisons.
Tree Generation
The procedure was similar to the one used in Guindon and Gascuel (2003), which can be referred to for further details. Random 48-taxon trees were generated using the standard Yule-Harding process, via the r8s program (Sanderson, 2003). This process makes the trees clocklike, so we created a deviation from this model by multiplying every branch length by (1 + X), where X followed an exponential distribution with expectation µ. The µ value represents the extent of deviation and was identical within each tree but different from tree to tree and equal to 0.2/(0.001+U), with U being uniformly drawn from [0, 1]. The smaller the U, the larger the µ and the larger the deviation from the molecular clock. Let tbl be the total branch length of the generated tree. We obtained the nonclocklike tree T with total length 1.0 by dividing every branch length by tbl. T was the "correct" tree that the various methods aimed at recovering.
To simulate the evolution of the different genes, we generated k trees Tp from T by multiplying every branch length of T by 0.4 + 8.6 Vp, where Vp was uniformly drawn from [0, 1]. The Vp value was the same within each tree Tp, but different from tree to tree. These k source trees Tp thus have the same topology as the tree T. However, they have their own evolutionary rates with relative values ranging from 1.0 to 22.5 (= (0.4 + 8.6)/0.4) in extreme cases; such values are in agreement with real values (see Guindon and Gascuel, 2003, for more and, below, our analysis of Gatesy et al. data, 2002).
Sequence Generation
We considered gene collections of size k = 2, 4, 6, ..., 20 and generated 500 data sets per k value. For each of these data sets, we first generated a correct tree T and then a collection of k gene trees Tp, as explained above. For each gene p, we uniformly drew sequence length
p between 200 and 1000 bp, and then used the Seq-Gen program (Rambaut and Grassly, 1997) to simulate the sequence evolution along Tp according to the K2P substitution model (Kimura, 1980). We used a transition/transversion ratio of 2.0 and did not rescale the Tp trees. To simulate partial overlap that occurs in real data sets, we randomly removed some of the taxa within each gene alignment obtained. Following Eulenstein et al. (2004), two overlap conditions were studied, corresponding to 25% taxon deletion per gene (strong overlap) and 75% deletion (low overlap). However, an overlap of at least four taxa was preserved between each gene pair to maintain a common evolutionary history between genes and avoid meaningless data sets. Note that the expected ratio of missing characters was also equal to 25% and 75%, respectively, due the random processes we used for sequence length generation and taxon deletion.
Inference Methods
The (10 gene collection size x 500 collections x 2 overlap conditions =) 10,000 generated datasets were used to compare a number of tree building approaches. Our aim was (1) to check the properties of SDM when used in various scenarios of low-, medium-, and high level; (2) to compare these SDM-based scenarios to classical approches (e.g., MRP), and (3) to compare SDM with other distance-based methods (e.g., ACS). All tested methods and scenarios are described below, grouped according to their combination level. Figure 1 displays a flowchart indicating the way the various scenarios combine several methods to achieve tree construction from gene collections (e.g., PhyML+MRP scenario involves first inferring gene trees using PhyML, then combining these trees using MRP).
|
Medium-level, Distance-Based Approaches.—
For each dataset, we used Paup* with K2P to estimate k distance matrices from the k sequence alignments. The SDM distance supermatrix was computed from this collection of matrices, with each matrix weighted by the length of the corresponding sequences (i.e. wp =
p) in formula (1). We then used the Fitch program (with all default options, notably without global rearrangements) to build a phylogeny from the SDM distance supermatrix that (possibly) contains missing entries. This tree-building scenario is called SDM+Fitch (Fig. 1). In order to test for the advantage of using aip variables, instead of solely using
p variables that deal with gene rate heterogeneity, we ran another similar scenario, where the aip variables were forced to be zero. This second scenario is called SDM* + Fitch; it is close to ACS as it only uses multiplication operation to rescale matrices (see also Bevan et al., 2005). We tested other approaches to deal with missing entries, as listed in the Introduction, but they all performed poorer than the Fitch weighted least-squares program (see further results and discussions regarding MW* by Makarenkov and Lapointe, 2004).
Low-Level Combination.—
For each dataset, a supermatrix of characters was obtained by concatenating the k partially deleted genes. We computed the K2P distance matrix from this supermatrix of characters using the PAUP*'s MissDist = Ignore option (see the Introduction) and then used BioNJ to infer the DTE (distance-based total evidence) phylogeny.
To obtain an ML-based total evidence phylogeny, we analyzed the supermatrix of characters using PhyML with K2P. This program searches for the optimal tree according to the ML criterion, via topological rearrangements from a starting tree. As these topological rearrangements are local and solely based on nearest neighbor interchange (NNI) (Swofford et al., 1996), the resulting tree depends, to some extent, on the starting tree. We call DTE+PhyML the scenario whereby the PhyML default option is used, which involves using DTE (see above and Fig. 1) to compute the starting tree. As we suspected that DTE would generate poor trees in this phylogenomics context, we also used SDM+Fitch (see above and Fig. 1) to infer the starting tree to be then refined by PhyML; we call this scenario SDM+Fitch+PhyML (Fig. 1). Our aim was to check that using the improved SDM starting tree, we improve the resulting PhyML tree and reduce the number of NNIs and the running time.
High-Level Combination.—
A collection of k ML phylogenies was built from the k partially deleted genes using PhyML with K2P. We then combined these trees using the standard MRP technique, which involves first coding the tree topologies in a partial binary matrix, then inferring the supertree by maximum parsimony. To achieve this task, we used TNT (Goloboff et al., 2003), which is well known for its efficiency, and followed the standard approach (Bininda-Emonds and Bryant, 1998) that defines the MRP supertree as the strict consensus of the most parsimonious trees. TNT was run with 25 random addition sequences, TBR branch swapping and ratchet default option. We call this supertree construction scenario PhyML+MRP (Fig. 1).
We also tested three distance-based supertree approaches, using the same PhyML source trees as with MRP. First, we evaluated ACS regarding its pioneer role in the field. Gene trees were transformed into path-length matrices, and we used the standardization procedure of Lapointe and Cucumel (1997), which applies to any number of source matrices, unlike the other standardizations presented by Lapointe and Levasseur (2004). This version of ACS was combined with the recent MW* algorithm (Makarenkov and Lapointe, 2004), which invokes both indirect and direct algorithms to deal with missing distances (see above). This scenario is called PhyML+ACS97+MW* (Fig. 1). We selected MW* to be combined with ACS as it was designed by the same author group, but we also performed experiments substituting MW* by Fitch. We applied SDM to the same path-length matrices as ACS and combined it with Fitch; this scenario is called PhyML+SDM+Fitch (Fig. 1). Finally, we evaluated (Creevey and McInerney, 2005) using default parameters and recommended options: NJ was applied to the MRP matrix using p-distances to obtain a starting tree; unitary path-length matrices corresponding to each gene tree were then computed and fed into MSS, which was run with SPR rearrangements. This scenario is called PhyML+MSS (Fig. 1).
Topological Accuracy Measure
We measured the topological accuracy of every scenario using the quartet distance dq (Estabrook et al., 1985) between the inferred tree
and the model tree T. dq counts the number of resolved 4-trees (i.e., four-taxon trees) present in one tree but not in the other. dq is then the sum of two error types: the type I error corresponding to resolved 4-trees induced by
but not present in T, the type II error corresponding to resolved 4-trees in T but not induced by
. As any fully resolved tree with n taxa induces
4-trees, this measure can take any integer value between 0 and 2
. dq is then more precise than the widely used bipartition distance (Bourque, 1978; Robinson and Foulds, 1979), which counts the number of internal branches present in one tree but not in the other, and then takes integer values between 0 and 2(n–3). Moreover, dq is less sensitive to slight topological differences; e.g., when just one taxon is misplaced and far away from its correct location, the bipartition distance is high as a number of bipartitions are incorrect, whiereas the dq distance remains moderate as only quartets involving this taxon are modified. Thus, dq is better suited than bipartition distance to compare remote trees Steel and Penny, 1993, as obtained with 75% deletion rate where tree inference is hard (see below). dq was normalized by dividing its value by 2
; 0 then corresponds to identical trees, whereas a distance of 1 means that both trees do not share any 4-tree. To avoid giving a topological meaning to very short branches in the infered trees, every branch length less than 0.0001 was collapsed to make a multifurcation.
| Simulation Results |
|---|
|
|
|---|
Topological Accuracy
For each of the 20 conditions (10 gene collection sizes x 2 overlap conditions), the collected 500 dq values were averaged and are graphically represented in Figure 2. These graphs show the average dq value as a function of the number (k) of genes.
|
As expected, all curves in Figure 2 are decreasing: the correct tree T is better recovered (i.e., the dq distance between
Among pure distance-based scenarios, SDM+Fitch is best. It outperforms SDM*+Fitch in all conditions, indicating that incorporating aip variables in criterion (1) gives a significant improvement over using only the
p multiplication factors. As expected, DTE performance is very poor and its results are out of the scales used in Figure 1. This is due to weak distance estimation caused by rate heterogeneity among genes and missing sequences. Indeed, when two taxa share slow genes, they are estimated to be close, whereas when their common genes are evolving fast, they are predicted to be distant. Applying BioNJ (or any other algorithm) to such a poor distance matrix inevitably results in a poor tree.
High-level scenarios combine the source trees inferred by PhyML into a supertree. Among the three distance approaches, PhyML+SDM+Fitch is best in all conditions, and PhyML+ACS97+MW* tends to outperform PhyML+MSS when the gene number (k) is relatively high. We also tested other combinations, substituting Fitch and MW* to obtain the PhyML+ACS97+Fitch and PhyML+SDM+MW* scenarios. Neither one nor the other is better than PhyML+SDM+Fitch, and PhyML+ACS97+Fitch outperforms PhyML+ACS97+MW*. This seems to indicate that Fitch (direct method) could be better suited than MW* (combining direct and indirect algorithms) to deal with distance matrices obtained in phylogenomics studies and containing missing entries. This somewhat contradicts findings presented by Makarenkov and Lapointe (2004), but could be explained by differences in the simulation protocols. These authors used a single distance (super)matrix with random deletion of pre-fixed numbers of entries, whereas our protocol is based on the assembly of several gene distance matrices and closer to phylogenomics data. Thus, our supermatrices are likely to be more perturbed than those in Makarenkov and Lapointe (2004), which could penalize indirect methods that only use a few distances to fill each of the missing entries. Note, moreover, that Fitch slightly outperformed MW* in one of the two experiments presented by Makarenkov and Lapointe (2004).
Comparing now SDM and MRP, we see that PhyML+MRP and PhyML+SDM+Fitch show similar accuracy with 25% taxon deletion, whereas the SDM-based scenario outperforms MRP with 75% deletion. In fact, it can be seen that PhyML+SDM+Fitch deals better with missing information (e.g., k = 2 with 25% and 75% deletion), whereas PhyML+MRP performs well when information is abundant (e.g., k = 20, where the two methods are close in both deletion conditions). This could be explained by the often poor resolution of MRP supertrees, which is due to the use of strict consensus and is higher with low source tree overlap (e.g., for k = 10 and 75% deletion rate, MRP supertrees contain 17% unresolved quartets on average). However, in the bootstrap analysis context, we showed that collapsing poorly supported branches improves topological accuracy (Berry and Gascuel, 1996) by decreasing type I error without significantly augmenting type II. A better explanation (Lapointe and Cucumel, 1997) could be that SDM not only uses the topology of the source trees (as MRP) but also their branch lengths. SDM-based trees then have more information than MRP trees. This could also explain the poor results of MSS, which loses information by setting all branches to length 1. Weighted MRP (where branches of the source trees are weighted by their bootstrap support; Ronquist, 1996) performs better than standard MRP (Bininda-Emonds and Sanderson, 2001), but at the expense of huge computing times, as with this approach the initial tree building algorithm (here, PhyML) has to run a number of times (at least 100) on each of the source data sets. However, fast branch support estimates could be used to accelerate these computations (Kishino et al., 1990; Waddell et al., 2002; Anisimova and Gascuel, 2006).
Low-level scenarios analyze the supermatrix of characters using PhyML, which improves, via NNIs, a starting tree built by a fast distance method. SDM+Fitch+PhyML clearly outperforms DTE+PhyML, which is a poor method with 75% taxon deletion. This is due to the extreme weakness of DTE with phylogenomics data, but not true with single gene study or when there are no missing characters. NNIs considerably improve DTE trees (see 25% deletion, where DTE+PhyML shows similar accuracy as MRP), but NNIs are not powerful enough to obtain a satisfactory tree when starting from quasirandom trees, as is the case with 75% deletion.
We then see the advantage of using SDM within the three combination levels. The medium-level SDM+Fitch scenario is even better than standard MRP with 75% deletion, whereas low-level SDM+Fitch+PhyML is clearly the best in all conditions, among the methods we tested. Moreover, this latter scenario could likely be improved by incorporating the specific rate of every gene in likelihood computations, as proposed by Yang (1996), Pupko et al. (2002), and Bevan et al. (2005). Finally, in the high-level context (which greatly simplifies the processing of various data types and evolutionary modes), SDM offers a relevant alternative to MRP as it is nearly equivalent with low (25%) taxon deletion, but significantly better with high (75%) deletion rate. Comparing the three SDM-based scenarios, we see a clear ordering: the low-level approach is best in all conditions, the medium-level method is worst, and the high-level combination is in between. As we shall see (and not surprisingly), this ordering is inverse of that induced by the computing times, the low-level scenario being much heavier than the two other methods. Moreover, the gap between high-level and low-level scenarios is moderate, and their ordering could be inverted with complex data sets showing strong heterogeneity in the evolutionary modes.
Running Time
The average running times of the main scenarios used in the simulations are displayed in Table 1, with k = 10, 20, and 25 % and 75 % taxon deletion. We also generated additional data sets with n = 96 taxa (10 per condition), using the previously described procedure and the same k values and taxon deletion rates, and reported the running times in Table 1. Note that all these times strongly depend on implementation. For example, the weighted least-squares procedure in Paup* is clearly faster than Fitch, whereas both follow a closely related scheme. Thus, results in Table 1 illustrate the main tendencies but should not be overinterpreted.
|
We first see that SDM on its own is a fast algorithm. For example, it only requires 48 s with 96 taxa, k = 20, and 25% taxon deletion. It follows that the running times of the SDM-based scenarios mostly depend on the other components of the scenarios, which tend to be (much) slower than SDM itself. The medium-level SDM+Fitch scenario is one of the fastest methods. For example, with 96 taxa, k = 20, and 25% taxon deletion, SDM+Fitch requires 539 s. Thanks to the speed of PhyML and TNT, PhyML+MRP is also quite efficient, being slower than SDM+Fitch with 48 taxa, but generally faster with 96. However, most of the computing time required by SDM+Fitch is spent by Fitch (e.g., 491 s as compared to 539 s, in our previous example). Fitch is useful as it copes with missing entries, but, as explained earlier, real data sets often yield full supermatrices of distance. In such cases, much faster inference algorithms do exist. In our simulations, all distance supermatrices are full when k = 20, n = 48, 96, and for both taxon deletion rates. With this (k = 20) data sets, we then used FastME instead of Fitch. Topological accuracy remains similar (e.g., with 48 taxa and 25% deletion, average dq topological distances are 0.0242 for SDM+Fitch and 0.0268 for SDM+FastME) but the tree inference time is less than 1 s in all settings; e.g., with 96 taxa, k = 20, and 25% taxon deletion, the SDM+FastME scenario requires approximately 50 s, as compared to 539 s when using Fitch. In this biologically common case, SDM+FastME is the fastest inference scenario, by a factor of 10- to 100-fold with 96 taxa, and this factor increases with the number of taxa. With 500 taxa and k = 20, SDM+FastME requires a few minutes, while other scenarios require hours (or days) of computation.
Comparing high-level scenarios, we see that with n = 96 PhyML+SDM+Fitch tends to be handicapped in comparison to PhyML+MRP, due to the use of Fitch. Replacing Fitch by FastME in case of full distance supermatrix does not significantly change the topological accuracy (e.g., with k = 20, 48 taxa, and 25% deletion rate, average dq topological distances are 0.0162 for PhyML+SDM+Fitch and 0.0168 for PhyML+SDM+FastME) but makes the SDM approach faster than PhyML+MRP. The two other high-level scenarios, PhyML+ACS97+MW* and PhyML+MSS, are slower than MRP- and SDM-based scenarios. PhyML+ACS97+MW* is penalyzed by MW* that is slower than Fitch (used here without global rearrangements, contrary to Makarenkov and Lapointe, 2004). MSS appears as a slow algorithm, likely due to the combination of its complex optimality criterion and of SPR topological rearrangements.
Finally, as trees built with SDM+Fitch are close to the correct tree T, we observe a clear improvement in PhyML running time when using SDM+Fitch as starting tree instead of DTE. Thus, with 96 taxa, k = 20, and 75% taxon deletion, SDM+Fitch+PhyML runs 4,286 s, as compared to 6,329 s for DTE+PhyML, i.e., a relative gain of around 50%.
We see from these comparisons that SDM-based scenarios not only have high topological accuracy but are also efficient relative to the other approaches. Moreover, they become much faster when the distance supermatrix does not contain any missing entry, thanks to the use of a fast distance-based tree building method.
| Application |
|---|
|
|
|---|
To illustrate the properties of SDM, we analyzed a data set of placental mammals (with focus on Cetartiodactyla), which was used by Gatesy et al. (2002) in a parsimony-based low-level combination framework. This taxonomic group was recently studied using different data and high-level approaches by Mahon (2004) and Price et al. (2005). We first describe Gatesy et al.'s data set and the various tree-building scenarios we tested, then provide the results, both in terms of running time and likelihood of the inferred trees. As we shall see, these results confirm our findings with simulated data sets.
Data and Tree Building Scenarios
The original Gatesy et al. data set comprises 57 character sources: 3 morphological data sets, 5 protein sequences, 1 tranposon, 33 nuclear genes, and 15 mitochondrial genes. As the current version of PhyML does no allow for separate analysis of various data types, we only retained the DNA coding sequences. We then considered a data set of 48 genes, 36,639 sites, and 75 placental mammals, from which 7 Afrotherians were used to root the inferred trees. As shown in Gatesy et al., this gene collection has high taxonomic sampling heterogeneity, and 68% of the characters are missing. To obtain a fair comparison between Gatesy et al.'s tree-building approach and the other scenarios, we run TNT on the 48 concatenated genes, with 25 random taxon additions, TBR branch swapping, and ratchet default option. The corresponding tree is called Gatesy-TNT in the following.
All scenarios described in our simulations were also applied to this gene collection. The distance matrices were estimated using the GTR model (Rodriguez et al., 1990). To weight source matrices in SDM, we used wp =
p /[ñp (ñp – 1)] in Equation (1), which compensates for taxon number heterogeneity among genes (e.g., 10 taxa for
-lactalbumin and 75 for cytochrome b). As the cytochrome b gene is present for all of the 75 taxa, the SDM distance supermatrix does not contain any missing entry and we used FastME instead of Fitch. Likelihood computations were performed using PhyML with the GTR+
model; we used eight rate categories and the gamma distribution parameter was estimated from the data. Invariant sites were not used as their proportion was estimated to be zero in preliminary studies. Moreover, to estimate the likelihood of all the topologies from the various scenarios, we fitted branch lengths and parameters to the original supermatrix of characters, using PhyML with the same GTR+
8 substitution model.
Results
The most likely tree is built by SDM+FastME+ PhyML. This phylogeny is shown in Figure 3 and its log-likelihood is equal to –330,354. This tree is relatively close to Gatesy et al.'s original tree, which has a log-likelihood of –330,492. Quartet distance between both is of 0.028. Although Gatesy et al. found that Camelidae+Tayassuidae+Suidae were monophyletic, our tree displays a basal position of Camelidae among Cetartiodactyla and a sister-group relationship between Suina and Hippopotamidae+Cetacea+Ruminantia. Such basal position of Camelidae has already been proposed and discussed by Madsen et al. (2001) and Waddell et al. (2003) in low-level combination studies, and by Price et al. (2005) in a high-level combination framework. We also found that Pholidota+Carnivora is the nearest parent of Perissodactyla, which is another different branching relative to the topology found by Gatesy et al. As the corresponding branching has low bootstrap support in the tree of Gatesy et al., the tree in Figure 3 represents a likely alternative (biologically and mathematically). Gatesy-TNT tree is not much different from Gatesy et al.'s original tree; its log-likelihood is of –330,428 (instead of –330,492) and quartet distance between both is equal to 0.007.
|
Results of all the scenarios are summarized in Table 2. We measured (1) the log-likelihood (as explained above), (2) the running time (using a 1.8-GHz Pentium IV PC with 1 Gb RAM), and (3) the topological distance between the corresponding tree and the best (SDM+FastME+PhyML) tree. As all trees are relatively close, we used the bipartition distance instead of the quartet distance to augment the contrast (see above comparison between both measures). This distance, also called Robinson and Foulds (dRF) distance, was normalized; 0 corresponds to identical trees, whereas 1 means that both trees do not share any bipartition (clade). Finally (4), we checked for the significance of our findings using Shimodaira asymptotically unbiased test (2002), as implemented in CONSEL software.
|
The SDM+FastME+PHYML tree is significantly better than the other trees (P = 0.986). Overall, the results are in good accordance with simulations, even though the ranking criteria are not the same (likelihood versus topological distance with the model tree). Low-level methods tend to be the best ones, including Gatesy-TNT parsimony-based approach (but excluding DTE). Moreover, using SDM+FastME (instead of DTE) to build a starting tree increases the likelihood of the resulting PHYML tree. Among high-level scenarios, PHYML+MRP performs best (
70 log-likelihood units below the best tree), PhyML+SDM+FastME is also efficient (
220 log-likelihood units below the best tree), whereas PhyML+MSS and PhyML+ACS+MW* are outperformed (
1900 log-likelihood units below the best tree). SDM+FastME medium-level scenario (
1200 log-likelihood units below the best tree) is behind the best high-level methods, but performs better than PhyML+MSS and PhyML+ACS+MW*. Finally, DTE is the worst of all methods, just as in simulations (
3,000 log-likelihood units below the best tree). Topological distances (measured by dRF) between the best and the other trees are also significant; e.g., DTE and the best trees share only 41% of clades, whereas the PhyML+MRP value is of 84%. Results with quartet distance are much less contrasted as those measures become 88.5% and 98.5%, respectively. This ordering of the scenarios is very similar to that of Figure 2, with 25% taxon deletion and large number of genes. Even though the ratio of missing characters in the Gatesy et al. data set is closer to 75% than to 25%, the gene number in this data set is large (48) and some genes are sequenced for all taxa (e.g., cytochrome b); information is then abundant, which explains the closeness with 25% (rather than 75%) taxon deletion.
The SDM+FastME tree is inferred in a running time of approximately 30 s, considerably faster than any other scenario (except SDM*+FastME and DTE). This confirms the benefits of this medium-level approach at an exploratory stage, or for building a starting tree. This speed should also be useful to perform bootstrap analysis, which is hardly achievable with other scenarios. The
p values estimated by SDM range from 0.26 to 2.80, with a median value of 1.17. As the
p parameters are inversely proportional to the evolutionary rates, these results show that the data set of Gatesy et al. is composed of genes with quite heterogeneous rates. For example, 1/
p = 0.356 corresponds to the slowest gene, the nuclear ZFX, and 1/
p = 3.755 corresponds to the fastest one, the mitochondrial ATP8; the rate ratio between both is about 11.0. SDM medium-level based scenario can then be used to obtain the evolutionary rates of the studied genes in a quick way (i.e., much faster than any ML-based method). The advantage of such approach was already discussed by Bevan et al. (2005), who used it to account for gene rate heterogeneity in ML-based tree inference with very low computational cost.
| Conclusion |
|---|
|
|
|---|
We have presented a novel method, named SDM, to combine a collection of source distance matrices into a single distance supermatrix. SDM can be used in tree-building scenarios of various levels and computational costs. Using large-scale simulations and a real phylogenomics case study, we have shown that SDM, used together with Fitch or FastME tree building programs, has topological accuracy similar to that of the popular MRP method. With low taxon overlap, SDM tends to outperform MRP, notably when it is used in a high-level way to combine gene trees. Moreover, in a low-level context, SDM can be used to quickly construct a starting tree to be refined by a maximum likelihood method. According to our simulations, this latter approach seems to be the most accurate gene combination method to date. This result could be affected by strong heterogeneity in the evolutionary modes of the studied genes, which was not incorporated in our simulations but may occur with real phylogenomic data. However, likelihood-based separate analysis (e.g., MrBayes; Huelsenbeck and Ronquist, 2001) provides a way to deal with such schemes in the low-level context, and SDM-based medium and high-level scenarios should not be affected as different models can be used to estimate their input (i.e., distances and trees, respectively).
The SDM algorithm is very fast. The computing time required by the SDM approach (i.e., first running SDM, then inferring the tree from the SDM supermatrix using a distance algorithm) greatly depends on the taxon overlap among genes. When the SDM supermatrix is complete (which occurs frequently, as some genes have been sequenced for a large number of species), the SDM approach is very efficient thanks to the use of fast algorithms such as NJ, BioNJ, or FastME; in this case, huge data sets can be dealt with in a few minutes on a standard computer. When the SDM supermatrix contains missing entries, as is the case for some recent very sparse data sets selected by computer programs (Driskell et al., 2004), slower algorithms such as Fitch or MW* must be used; then the SDM approach is not as efficient with running times similar to those of MRP.
A key direction for further research is to develop fast algorithms, as fast as NJ or FastME, to accurately reconstruct trees from distance matrices with missing entries. Other directions include exploring new weighting schemes within the SDM optimality criterion (1), or new linear constraints on the parameters.
Our implementation of the SDM method, in JAVA 1.4 for better portability, is available at http://www.lirmm.fr/~criscuol/soft/sdm. All simulated data sets can be downloaded from http://www.lirmm.fr/~criscuol/soft/sdm/data sets.
| Appendix |
|---|
|
|
|---|
The goal is to minimize criterion (3), which can be written as
|
|
1,...,
p,...,
k,...,aip,...) and |
|
|
|
|
|
|
|
This is a quadratic programming problem with equality constraints. In principle, inequalities
p
0 should be added, but in practice we have never found negative
p values, neither with simulated sequences nor with biological data sets. The necessary first-order condition (equivalent to nullity of the first derivative in unconstrained monodimensional optimization) that any minimizer must satisfy is (Luenberger, 1984: 300):
|
|
, µi and
p are the Lagrange multipliers induced by linear constraints h(1), h(2)i, and h(3)p, respectively.
We have:
|
|
|
|
|
|
p ñp equations and parameters (including Lagrange multipliers), and S has at least one solution. For S to define the unique global optimum of f(v) subject to the constraints, the second-order necessary condition (Luenberger, 1984: 306) must be fulfilled (equivalent to positivity of the second derivative in unconstrained mono dimensional minimization). In our quadratic programming problem, where f(v) is non-negative, this condition becomes: |
|
2) we have
p
ijp + aip+ ajp =
p'
ijp'+ aip'+ ajp',
p, p' : {i,j}
i,j: i < j, kij
2(kij– 1) independent linear equations. Combining these equations with the k + ñ constraints, we obtain a linear system with k +
p ñp parameters (i.e., the size of v). The second-order sufficiency condition is then equivalent to testing the linear independence of the set of column vectors defining this linear system. This can easily be achieved numerically. However, except in very special cases (corresponding to equalities or redundancies, see example below), this vector set is linearly independent as soon as the number of vectors is less than, or equal to, the vector dimension, that is: |
|
|
| (7) |
The left-hand side term in (7) measures the matrix overlap. For example, in the extreme case where we only have two source matrices that only share two taxa, the three components in this term equal 1, 2, and –4, respectively, and (7) is violated; in other words, a single (1) distance comparison plus 4 (= 2+k) constraints is not enough to estimate 6 (= 4 + k) parameters. Assuming now that the two matrices share 3 taxa, the sum in (7) becomes 3+3–6 = 0, i.e., 3 comparisons are enough to estimate 8 parameters subject to 5 constraints, and S defines a unique global optimum.
Unicity of the global optimum yields the consistency of the SDM approach in estimating the relative rates of the genes. Assume that all source matrices are issued from a single (
ij) matrix through multiplication by
p factors, each representing the evolutionary rate of gene p. We have (
ijp) = (
p
ij), just as in the proportional model of Yang (1996). Moreover, without loss of generality, assume that the
ps are rescaled to obtain
p 1/
p = k. SDM then consistently estimates the
p values, as soon as condition (7) is fulfilled. Let v* be defined by
p* = 1/
p and aip = 0,
i,p. It is easily seen that f(v*) = 0 and that all the constraints are satisfied by v*. As (7) is fulfilled, v* is the unique solution of S, and
p = 1/
p* is a consistent estimator of
p, which finishes the consistency proof.
| Acknowledgments |
|---|
|
|
|---|
We thank Nicolas Galtier, Fabienne Thomarat, and Maria Anisimova for helpful comments and suggestions. We also thank Olaf Bininda-Emonds, Roderic Page, François-Joseph Lapointe and an anonymous reviewer for their help during the reviewing process. This work has been supported by ACI Informatique, Mathématique et Physique en Biologie Moléculaire (ACI IMP-Bio), and by IFR119 Biodiversité Continentale Méditerranéenne et Tropicale (Montpellier) computing facilities. This publication is the contribution number 2005-054 of the Institut des Sciences de l'Evolution de Montpellier (UMR 5554, CNRS).
| References |
|---|
|
|
|---|
-
Anisimova M., Gascuel O. Approximate likelihood ratio test for branches: A fast, accurate and powerful alternative. Syst. Biol. (2006) in press.
Barthélemy J. P., Guénoche A. Trees and proximity relations. Wiley-Interscience Series in Discrete Mathematics and Optimization (1991) Chichester: John Wiley & Sons.
Baum B. R. Combining trees as a way of combining data sets for phylogenetic inference, and the desirability of combining gene trees. Taxon (1992) 41:3–10.[CrossRef][Web of Science]
Berry V., Gascuel O. Interpretation of bootstrap trees: Threeshold of clade selection and induced gain. Mol. Biol. Evol. (1996) 13:999–1011.
Bevan R. B., Lang B. F., Bryant D. Calculating the evolutionary rates of different genes: A fast, accurate estimator with applications to maximum likelihood phylogenetic analysis. Syst. Biol. (2005) 54:900–915.
Bininda-Emonds O. R. P., ed. Phylogenetic supertrees: Combining information to reveal the tree of life (2004) Dordrecht: Kluwer Academic Publishers.
Bininda-Emonds O. R. P., Bryant H. N. Properties of matrix representation with parsimony analyses. Syst. Biol. (1998) 47:497–508.[Web of Science][Medline]
Bininda-Emonds O. R. P., Sanderson M. J. Assessment of the accuracy of matrix representation with parsimony supertree construction. Syst. Biol. (2001) 50:565–579.
Bourque M. Arbres de Steiner et réseaux dont varie l'emplagement de certains sommets (1978) Département d'informatique et de recherche opérationnelle, Université de Montréal. PhD thesis.
Bryant D. A classification of consensus methods for phylogenetics. In: Bioconsensus—Janowitz M., Lapointe F.-J., McMorris F. R., Mirkin B., Roberts F. S., eds. (2001) Providence: DIMACS-AMS edition. 163–184.
Creevey C. J., McInerney J. O. Clann: Investigating phylogenetic information through supertree analyses. Bioinformatics (2005) 21:390–392.
Daubin V., Gouy M., Perrière G. A phylogenomic approach to bacterial phylogeny: Evidence of a core of genes sharing a common history. Genome Res. (2002) 12:1080–1090.
De Soete G. Additive tree representations of incomplete dissimilarity data. Quality Quantity (1984) 18:387–393.[Web of Science]
Desper R., Gascuel O. Fast and accurate phylogeny reconstruction algorithms based on the minimum-evolution principle. J. Comp. Biol. (2002) 19:687–705.
Devulder G., Pérouse de Montclos M., Flandrois J. P. A multigene approach to phylogenetic analysis using the genus Mycobacterium as a model. Int. J. Syst. Evol. Microbiol. (2005) 55:293–302.
Driskell A. C., Ané C., Burleigh J. G., McMahon M. M., O'Meara B. C., Sanderson M. J. Pospects for building the tree of life from large sequence databases. Science (2004) 306:1172–1174.
Eisen J. A., Fraser C. M. Phylogenomics: Intersection of evolution and genomics. Science (2003) 300:1706–1707.
Estabrook G. F., McMorris F. R., Meacham C. A. Comparison of undirected phylogenetic trees based on subtrees of four evolutionary units. Syst. Zool. (1985) 34:193–200.
Eulenstein O., Chen D., Burleigh J. G., Fernandez-Baca D., Sanderson M. J. Performance of flip supertree construction with a heuristic algorithm. Syst. Biol. (2004) 53:299–308.
Farach M., Kannan S., Warnow T. A robust model for finding optimal evolutionary trees. Algorithmica (1995) 13:155–179.[CrossRef][Web of Science]
Felsenstein J. Phylip: Phylogeny inference package, version 3.6b (1993) Seattle: University of Washington. Distributed by the author.
Felsenstein J. An alternating least-squares approach to inferring phylogenies. Syst. Biol. (1997) 46:101–111.
Fitch W. M., Margoliash E. Construction of phylogenetic trees. Science (1967) 155:279–284.
Gascuel O. A note on Sattath and Tversky's, Saitou and Nei's and Studier and Keppler's algorithms for inferring phylogenies from evolutionary distances. Mol. Biol. Evol. (1994) 11:961–963.[Web of Science][Medline]
Gascuel O. BioNJ: An improved version of the NJ algorithm based on a simple model of sequence data. Mol. Biol. Evol. (1997) 14:685–695.[Abstract]
Gatesy J., Matthee C., DeSalle R., Hayashi C. Resolution of a supertree/supermatrix paradox. Syst. Biol. (2002) 51:652–664.
Goloboff P., Farris J., Nixon K. TNT: Tree analysis using new technology (2003) Distributed by the authors.
Guénoche A., Grandcolas S. Approximations par arbre d'une distance partielle. Math. Inf. Sci. Humaines (1999) 146:51–64.
Guindon S., Gascuel O. A simple, fast and accurate algorithm to estimate large phylogenies by maximum likelihood. Syst. Biol. (2003) 52:696–704.
Huelsenbeck J. P. When are fossils better than extant taxa in phylogenetic analysis? Syst. Zool. (2001) 40:458–469.
Huelsenbeck J. P., Ronquist F. MrBAYES: Bayesian inference of phylogenetic trees. Bioinformatics (2001) 17:754–755.
Huson D. H., Nettles S. M., Warnow T. J. Disk-covering, a fast-converging method for phylogenetic tree reconstruction. J. Comp. Biol. (1999) 6:369–386.[CrossRef]
Kearney M. Fragmentary taxa, missing data, and ambiguity: Mistaken assumptions and conclusions. Syst. Biol. (2002) 51:369–381.
Kimura M. A simple method for estimating evolutionary rate of base substitutions through comparative studies of nucleotide sequences. J. Mol. Evol. (1980) 16:111–120.[CrossRef][Web of Science][Medline]
Kishino H., Miyata T., Hasegawa M. Maximum likelihood inference of protein phylogeny ans the origin of chloroplasts. J. Mol. Evol. (1990) 31:151–160.[CrossRef][Web of Science]
Landry P.-A., Lapointe F.-J., Kirsch J. A. W. Estimating phylogenies from lacunose distance matrices: Additive is superior to ultrametric estimation. Mol. Biol. Evol. (1996) 13:818–823.
Lapointe F.-J., Cucumel G. The average consensus procedure: Combination of weighted trees containing identical or overlapping sets of taxa. Syst. Biol. (1997) 46:306–312.
Lapointe F.-J., Kirsch J. A. W., Hutcheon J. M. Total evidence, consensus and bat phylogeny: A distance-based approach. Mol. Phylogenet. Evol. (1999) 11:55–66.[CrossRef][Web of Science][Medline]
Lapointe F.-J., Landry P.-A. A fast procedure for estimating missing distances in incomplete matrices prior to phylogenetic analysis. In: Currents computational molecular biology—El-Mabrouk N., Lengauer T., Sankoff D., eds. (2001) Montréal: Publications CRM. 189–190.
Lapointe F.-J., Levasseur C. Everything you always wanted to know about the average consensus, and more. In: Phylogenetic supertrees: Combining information to reveal the tree of life—Bininda-Emonds O. R. P., ed. (2004) Dordrecht: Kluwer Academic. 87–105.
Lapointe F.-J., Wilkinson M., Bryant D. Matrix representation with parsimony or with distances: Two sides of the same coin? Syst. Biol. (2003) 52:865–868.
Levasseur C., Lapointe F.-J. War and peace in phylogenetics: A rejoinder on total evidence and consensus. Syst. Biol. (2001) 50:881–891.
Luenberger D. G. Linear and nonlinear programming (1984) London: Addison-Wesley.
Madsen O., Scally M., Douady C. J., Kao D. J., DeBry R. W., Adkins R., Amrine H. M., Stanhope M. J., de Jong W. W., Springer M. S. Parallel adaptive radiations in two major clades of placental mammals. Nature (2001) 409:610–614.[CrossRef][Medline]
Mahon S. A. A molecular supertree of the Artiodactyla. In: Phylogenetic supertrees: Combining information to reveal the tree of life—Bininda-Emonds O. R. P., ed. (2004) Dordrecht: Kluwer Academic. 411–437.
Makarenkov V. Trex: Reconstructing and visualizing phylogenetic trees and reticulation networks. Bioinformatics (2001) 17:664–668.
Makarenkov V., Lapointe F.-J. A weighted least-squares approach for inferring phylogenies from incomplete distance matrices. Bioinformatics (2004) 20:2113–2121.
Makarenkov V., Leclerc B. An algorithm for the fitting of a phylogenetic tree according to a weighted least-squares criterion. J. Classif. (1999) 16:3–26.[CrossRef]
Nei M., Jin L. Variances of the average numbers of nucleotides substitutions within and between populations. Mol. Biol. Evol. (1989) 6:290–300.[Abstract]
Page R. Modified mincut supertrees. In: Lecture notes in computer science volume 2452—Guigo R., Gusfield D., eds. (2002) 537–552.
Philippe H., Lartillot N., Brinkmann H. Multigene analyses of bilaterian animals corroborate the monophyly of Ecdysozoa, Lophotrochozoa, and Protostomia. Mol. Biol. Evol. (2005) 22:1246–1253.
Philippe H., Snell E. A., Bapteste E., Lopez P., Holland P. W. H., Casane D. Phylogenomics of eukaryotes: Impact of missing data on large alignments. Mol. Biol. Evol. (2004) 21:1740–1752.
Piaggio-Talice R., Burleigh G., Eulenstein O. Quartet supertrees. In: Phylogenetic supertrees: Combining information to reveal the tree of life—Bininda-Emonds O. R. P., ed. (2004) Dordrecht: Kluwer Academic. 173–191.
Price S. A., Bininda-Emonds O. R. P., Gittleman J. L. A complete phylogeny of the whales, dolphins and even-toed hoofed mammals (Cetartiodactyla). Biol. Rev. Comb. Philos. Soc. (2005) 80:445–473.[CrossRef]
Pupko T., Huchon D., Cao Y., Okada N., Hasegawa M. Combining multiple data sets in a likelihood analysis: Which models are the best? Mol. Biol. Evol. (2002) 19:2294–2307.
Ragan M. A. Phylogenetic inference based on matrix representation of trees. Mol. Phylogenet. Evol. (1992) 1:53–58.[CrossRef][Medline]
Rambaut A., Grassly N. C. Seq-Gen: An application for the Monte Carlo simulation of DNA sequence evolution along phylogenetic trees. Comp. Appl. Biosi. (1997) 13:235–238.
Robinson D., Foulds L. Comparison of weighted labeled trees. Lect. Notes Math. (1979) 748:119–126.[CrossRef]
Rodriguez R., Oliver J. L., Marin A., Medina J. R. The general stochastic model of nucleotide substitution. J. Theo. Biol. (1990) 142:485–501.[Web of Science][Medline]
Ronquist F. Matrix representation of trees, redundancy, and weighting. Syst. Biol. (1996) 45:247–253.
Saitou N., Nei M. The neighbor-joining method: A new method for reconstructing phylogenetic trees. Mol. Biol. Evol. (1987) 4:406–425.[Abstract]
Sanderson M. J. Inferring absolute rates of molecular evolution and divergence times in the absence of molecular clock. Bioinformatics (2003) 19:301–302.
Schmidt H. A. Phylogenetic Trees from Large Datasets (2003) Düsseldorf, Germany. PhD thesis.
Schmidt H. A., Strimmer K., Vingron M., von Haeseler A. TREE-PUZZLE: Maximum likelihood phylogenetic analysis using quartets and parallel computing. Bioinformatics (2002) 18:502–504.
Semple C., Steel M. A supertree method for rooted trees. Disc. Appl. Math. (2000) 105:147–158.[CrossRef]
Shimodaira H. An approximately unbiased test of phylogenetic tree selection. Syst. Biol. (2002) 51:492–508.
Steel M. A., Penny D. Distribution of tree comparison metrics—Some new results. Syst. Biol. (1993) 42:126–141.
Strimmer K., von Haeseler A. Quartet puzzling: A quartet maximum likelihood method for reconstructing tree topologies. Mol. Biol. Evol. (1996) 13:964–969.[Web of Science]
Studier J. A., Keppler K. J. A note on the neighbor-joining method of Saitou and Nei. Mol. Biol. Evol. (1988) 5:729–731.[Web of Science][Medline]
Swofford D. L. PAUP*: Phylogenetic analysis using parsimony (*and other methods) (2002) Sunderland, Massachussetts: Sinauer Associates. version 10.
Swofford D. L., Olsen G. J., Waddell P. J., Hillis D. M. Phylogenetic inference. In: Molecular systematics—Hillis D. M., Moritz C., Mable B. K., eds. (1996) Sunderland, Massachussetts: Sinauer Associates. 407–514.
Waddell P. J., Kishino H., Ota R. Very fast algorithms for evaluating the stability of ML and Bayesian phylogenetic trees from sequence data. Genome Inform. (2002) 13:82–92.[Medline]
Waddell P. J., Shelley S. Evaluating placental inter-ordinal phylogenies with novel sequences including RAG1,
-fibriogen, ND6, and mt-tRNA, plus MCMC-driven nucleotide, amino acid, and codon models. Mol. Phylogenet. Evol. (2003) 28:197–224.[CrossRef][Web of Science][Medline]
Wiens J. J. Does adding characters with missing data increase or decrease phylogenetic accuracy? Syst. Biol. (1998) 47:625–640.
Wiens J. J., Reeder T. W. Combining data sets with different numbers of taxa for phylogenetic analysis. Syst. Biol. (1995) 44:548–558.
Yang Z. Maximum-likelihood models for combined analysis of multiple sequence data. J. Mol. Evol. (1996) 42:587–596.[CrossRef][Web of Science][Medline]
This article has been cited by other articles:
![]() |
A. M. Zelazny, J. M. Root, Y. R. Shea, R. E. Colombo, I. C. Shamputa, F. Stock, S. Conlan, S. McNulty, B. A. Brown-Elliott, R. J. Wallace Jr., et al. Cohort Study of Molecular Identification and Typing of Mycobacterium abscessus, Mycobacterium massiliense, and Mycobacterium bolletii J. Clin. Microbiol., July 1, 2009; 47(7): 1985 - 1995. [Abstract] [Full Text] [PDF] |
||||
![]() |
F. Cheng, S. Hartmann, M. Gupta, J. G. Ibrahim, and T. J. Vision A hierarchical model for incomplete alignments in phylogenetic inference Bioinformatics, March 1, 2009; 25(5): 592 - 598. [Abstract] [Full Text] [PDF] |
||||
![]() |
V. Ranwez, V. Berry, A. Criscuolo, P.-H. Fabre, S. Guillemot, C. Scornavacca, and E. J. P. Douzery PhySIC: A Veto Supertree Method with Desirable Properties Syst Biol, October 1, 2007; 56(5): 798 - 817. [Abstract] [Full Text] [PDF] |
||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||












