© 2007 Society of Systematic Biologists
Species Trees from Gene Trees: Reconstructing Bayesian Posterior Distributions of a Species Phylogeny Using Estimated Gene Tree Distributions
Edited by Thomas Buckley: Associate Editor
1 Department of Statistics, The Ohio State University Columbus, OH, 43210-1293, USA
| Abstract |
|---|
|
|
|---|
The desire to infer the evolutionary history of a group of species should be more viable now that a considerable amount of multilocus molecular data is available. However, the current molecular phylogenetic paradigm still reconstructs gene trees to represent the species tree. Further, commonly used methods of combining data, such as the concatenation method, are known to be inconsistent in some circumstances. In this paper, we propose a Bayesian hierarchical model to estimate the phylogeny of a group of species using multiple estimated gene tree distributions, such as those that arise in a Bayesian analysis of DNA sequence data. Our model employs substitution models used in traditional phylogenetics but also uses coalescent theory to explain genealogical signals from species trees to gene trees and from gene trees to sequence data, thereby forming a complete stochastic model to estimate gene trees, species trees, ancestral population sizes, and species divergence times simultaneously. Our model is founded on the assumption that gene trees, even of unlinked loci, are correlated due to being derived from a single species tree and therefore should be estimated jointly. We apply the method to two multilocus data sets of DNA sequences. The estimates of the species tree topology and divergence times appear to be robust to the prior of the population size, whereas the estimates of effective population sizes are sensitive to the prior used in the analysis. These analyses also suggest that the model is superior to the concatenation method in fitting these data sets and thus provides a more realistic assessment of the variability in the distribution of the species tree that may have produced the molecular information at hand. Future improvements of our model and algorithm should include consideration of other factors that can cause discordance of gene trees and species trees, such as horizontal transfer or gene duplication.
Keywords: Bayesian hierarchical model; coalescent theory; gene tree; MCMC; species tree
Received July 20, 2006; Revised November 22, 2006; Accepted March 6, 2007
Traditional molecular-based phylogenetic analysis consists broadly of two steps: obtaining and aligning molecular sequences and inferring gene trees for those sequences. Under this paradigm, gene trees are generally considered to be synonymous with species trees, except when forces causing discordance between gene and species trees are obvious, such as horizontal gene transfer, deep coalescence, or gene duplication (Maddison, 1997; Maddison and Knowles, 2006). Thus, in fact, molecular phylogenetic analysis really consists of three elements: molecular sequences, gene trees, and species trees. Identifying the relationships among these three elements and extracting useful information from each element are key issues for constructing an appropriate model to explain the evolutionary history of a set of species.
One discussion in the literature revolves around which should be used as the direct estimator of the species tree, sequences or gene trees? (Kluge and Wolf 1989, 1993) claimed that natural data partitions do not exist and the species tree should be estimated using the whole sequence of the genome. They proposed a combined-data approach (Kluge, 1989; Kluge and Wolf, 1993; Nixon and Carpenter, 1996) in which the sequences from all available genes are concatenated into a single sequence, along with other phylogenetic characters such as morphology or behavior. This method ignores the existence of the gene as the basic functional unit on the genome and treats nucleotides as the direct estimator of the species tree. This has drawn criticism (Slowinski and Page, 1999) because it assumes that nucleotides are independent estimates of the species tree so that the longer the sequence, the more precise the estimated species tree. It is now generally appreciated that gene trees in principle may not match the species tree irrespective of whether the gene has a long sequence or a short sequence. Indeed, recent work shows that under some combinations of branch lengths in the species tree, incongruent gene trees are more likely to occur than congruent gene trees (Kubatko and Degnan, 2006; Degnan and Rosenberg, 2006). In other words, nucleotide or amino acid data are not consistent estimators of the species tree under some circumstances of speciation. Other approaches consider gene trees as the direct estimator of the species tree (Page, 1998; Pamilo, 1988). This idea is based on Doyle's (1992) concept that nucleotides are characters of gene trees, whereas gene trees are characters of species trees (Maddison, 1997). This viewpoint suggests that using sequence data to infer species phylogeny requires two hierarchical levels of estimation: gene tree estimation and species tree estimation.
Methods for inferring gene trees from sequence data are numerous and have become extraordinarily sophisticated in recent years (Durand et al., 2006; Coop and Griffiths, 2004). However, methods for inferring species trees from gene trees are in their infancy, are not widely used, and in general suffer from numerous statistical and methodological drawbacks. For example, the concatenation method may lead to an incorrect estimate with high probability and is unable to facilitate estimation of important parameters in the evolutionary history of species such as population sizes or speciation times. Speciation times here are distinct from gene divergence times due to the coalescent process (Edwards and Beerli, 2000). The deep coalescence method (Maddison and Knowles, 2006) ignores the errors in gene tree estimation and generally assumes that gene trees are estimated with perfect certainty. In this approach, maximum likelihood (ML) or maximum parsimony (MP) trees are built for each gene and used as the true gene trees to infer the species tree. This method then underestimates the variation in the procedure for inferring a species phylogeny. The gene tree parsimony method (Page, 1998; Simmons et al., 2000) and consensus tree methods (Bull et al., 1993; de Queiroz, 1993; Rodrigo et al., 1993; Huelsenbeck et al., 1994) incorporate uncertainty of gene tree estimation by using the bootstrap gene trees (Cotton and Page, 2002) or the estimated posterior distribution of gene trees (Buckley et al., 2006). When estimating the posterior distribution of gene trees, these methods estimate the gene tree for each gene independently using an arbitrary prior distribution for each gene that involves no information about the species tree. Specifically, the joint prior distribution of gene trees G is assumed to be the product of the prior distribution for each gene tree.
|
|
Recent research has focused on a statistical model for species tree estimation. Here coalescent theory plays a central role in this model construction. Takahata applied coalescent theory and estimated the species tree for three related populations (Takahata, 1989). Degnan and Salter (2005) have derived the probability distribution for the topology of gene trees given the species tree. Slatkin and Pollack (2006) specified a statistical model for the gene genealogies of two linked loci of three species. Coalescent theory has also been applied to forming the likelihood for genetic markers such as RFLPs, SNPs, and AFLPs (Nielsen et al., 1998a, 1998b). As an alternative to deep coalescence, Arvestad considered gene duplication as the major source of conflicts between gene trees and species trees and proposed a Bayesian hierarchical model to estimate the species tree and gene trees simultaneously (Arvestad et al., 2004).
When estimating the posterior distribution of gene trees, the prior of gene trees should depend on the species tree because gene trees can be thought of as random variables generated from the species tree. When the species tree and population sizes,
, are set as unknown constants, the prior of gene trees G is here assumed to be
|
|
are taken to be random variables. To incorporate uncertainty in the species tree and
, the prior distribution of gene trees G is the sum of f(G| S,
) over all species trees and
, |
|
This suggests that a Bayesian model for inferring the species tree using sequence data should have the following features:
- It should simultaneously involve the distribution of sequences, gene trees, and the species tree.
- The underlying species tree should induce a marginal dependence in the gene trees, which should then be inferred jointly across loci.
- The model must take into account errors in the estimation of gene trees.
| Bayesian Hierarchical Model |
|---|
|
|
|---|
In the equations that follow, we use the following abbreviations: D: sequence data; G: a vector of gene trees;
: parameters in the likelihood function except the gene tree vector G; S: species trees;
: transformed effective population sizes,
= 4Ne µ.
The posterior probability of a species tree and
is given by
|
|
Our Bayesian hierarchical structure consists of modeling the following components: f(D|G,
), f(
), f(G| S,
), f(S), and f(
), each of which is explained below.
- f(D|G,
). Markovian models dominate the likelihood-based literature for both nucleotide and amino acid substitution (Felsenstein, 2004b). It is worth mentioning that, although models for nucleotide and protein sequence data are the most common, our formulation allows for any type of underlying input data where f(D|G,
) can be appropriately described. The quantity f(D |G,
) will change according to the input data and, for the same type of data, the most suitable model may be selected using a likelihood-based model selection process (Posada and Crandall, 1998) or information theory (Minin et al., 2003).
- f(
).
includes the parameters in the substitution model and all other parameters in the likelihood function except the gene tree. Naturally, the prior on
will depend on the nature of the data at hand. For example, a variety of options for the prior of
are available in the Bayesian gene tree program, MrBayes (Ronquist et al., 2005).
- f(G| S,
). The distribution of gene trees given the species tree is derived from coalescent theory. Although the procedure can allow more general models, our initial implementation uses the coalescent theory in which random mating is assumed in each population. We also assume no gene flow after species divergences and no recombination within a locus but free recombination between loci.
- f(
The branch length in a species tree represents "time" (numbers of generations), whereas it is the expected number of mutations in a gene tree. To make the two parameters compatible, we transform to
= 4µ Ne, where Ne is the effective population size and µ is the mutation rate measured as the expected number of nucleotide substitutions per site per generation.
The joint probability distribution of a gene tree topology and the m-n coalescent times tn + 1, ...,tm for a single population reduced from m to n sampled individuals along a branch of length
in a species tree was derived by Rannala and Yang (2003) to be
|
|
Thus f(G| S,
) is the product of such probabilities across all the populations. For a vector of gene trees, G, that are independent given the species tree, we multiply these conditional likelihoods in turn to find f(G| S,
). It should be noted that the species tree space is constrained because we assume that the gene split times of any two species predate their speciation time. So the cumulative node-to-tip branch lengths in the gene trees are always longer than their counterparts in the species tree.
Note also that the
may be different for different genes. For example, the mitochondrial and Y-chromosomal genes are uniparentally inherited and haploid. Thus, in the data analyses below, we assume their effective population sizes are one fourth that of autosomal markers.
- f(
). We use independent gamma distributions as the prior of
. The gamma density function is
with mean
β and variance
β2. The hyperparameters
and β can be chosen to reflect the range and likely values of
(Rannala and Yang, 2003).
- f(S). We use a birth-and-death process (Nee et al., 1994) as the prior distribution of the species tree's topology and branch lengths. Given the speciation rate (s), extinction rate (e), and the number of species (n), the joint density of the topology (T) and branch lengths (
) of a particular species tree is (Yang and Rannala, 1997):
where
and

- f(S). We use a birth-and-death process (Nee et al., 1994) as the prior distribution of the species tree's topology and branch lengths. Given the speciation rate (s), extinction rate (e), and the number of species (n), the joint density of the topology (T) and branch lengths (
Mutation rate variation among loci may influence the estimation of ancestral population sizes (Yang, 1997; Chen and Li, 2001). Nevertheless, if the ratios of rates between loci are known, we can incorporate them in the likelihood calculation (Yang, 1997). We treat these relative mutation rates among loci as parameters in our model and assume that their prior follows the uniform (0,10) under the constraint that the average ratio is 1.
| Computational Algorithm |
|---|
|
|
|---|
The entire species tree estimation procedure consists of three steps.
- Step 1 (within MrBayes): Generate vectors of gene trees from MrBayes using the approximate prior, K(G), based on a "Maximum" species tree estimate in the Hastings ratio to decide on acceptance of each vector into the Markov chain.
- Step 2: Using a second MCMC algorithm, generate species trees from the distribution compatible with the gene trees given by the approximate posterior distribution K(G| D) from step 1.
- Step 3: Use importance sampling to align the results with what would have occurred if the initial sample had been from the true prior, f(G).
- Step 2: Using a second MCMC algorithm, generate species trees from the distribution compatible with the gene trees given by the approximate posterior distribution K(G| D) from step 1.
Markov chain Monte Carlo (MCMC) is implemented to evaluate the posterior distribution of the species tree because f(D) involves an intractable integral. The posterior distribution of the species tree can be formulated as follows:
|
| (1) |
The posterior of species tree and population sizes given data,f(S,
| D), is the posterior of species tree and
given gene trees f(S,
| G) weighted by f(G| D). This motivates our algorithm to generate the posterior distribution of gene trees first and then use these gene trees to generate the posterior for the species tree.
However, in the first stage of using DNA sequences to estimate the posterior of gene trees, the prior of gene trees, f(G), is unknown. Theoretically, f(G) is equal to the integration of f(G| S) with respect to the species tree (topology and branch lengths) and population size
, namely,
|
|
, is approximated using the Monte Carlo method. This prior is used to define the Hastings ratios in MrBayes that decides whether a vector of gene trees is accepted into the Markov chain. The chain is then run to convergence, generating a sample from the approximate posterior distribution K(G| D). We save a subsample from this chain G1, G2, ..., GN along with the associated approximate priors K(G1), K(G2), ..., K(GN) to be used in steps 2 and 3. In step 2 we find the posterior distribution of the species tree given the gene tree vectors generated in step 1. Here a second MCMC algorithm is applied. For this algorithm, the birth-and-death process was used to define the prior distribution of species trees (Nee et al., 1994) and the likelihood is defined by coalescent theory via Rannala and Yang's formula. The movement strategy employed a random selection of nodes and replacement uniformly within a random band that maintained the constraints while adjusting the topology where needed.
Step 2 provides k samples from f(S| Gi) for each of the gene tree vectors G1, G2, ..., GN arising from the samples in step 1. Finally, importance sampling is applied to find the posterior distribution of species trees given the data. Note that:
|
|
|
|
, is the probability that a random species tree chosen from the birth-and-death model satisfies the constraints associated with Gi. Thus, a consistent estimate of f(Gi) is given by
|
| (2) |
|
|
|
|
|
|
We wrote a program in C language to implement the algorithm which is available at www.stat.osu.edu/~dkp/BEST. The BEST part of the algorithm goes very quickly, but the Markov chains in MrBayes must be run approximately 10 times as long as usual in order to accommodate the importance sampling. The run time is O (n2) in terms of number of taxa and O(n) in terms of number of loci.
| Data Analysis |
|---|
|
|
|---|
Australian Finch Data
We first apply the new method to a multilocus nucleotide dataset from birds recently published by Jennings and Edwards (2005). They obtained the allelic data of 30 loci (Pa-01... Pa-30) from one individual per population of Poephila acuticauda, P. hecki, P. cincta. They also included sequences from a more distant relative, the zebra finch (P. guttata), as outgroup. A total of 30 anonymous loci were developed, ranging in size from 216 to 825 bp. They performed a "four-gamete test" and the result showed that the overall incidence of intralocus recombination in the data appears very low, which supports one of the assumptions in our model that there is no intralocus recombination. Jennings and Edwards also used the assumed species tree topology previously supported by morphological and mtDNA studies and employed a multilocus coalescent approach to infer the effective population sizes and divergence times.
Posterior distributions of gene trees for 30 genes using the independent prior
The posterior distributions of gene trees are estimated in MrBayes assuming independent loci. HKY85 (Hasegawa et al., 1985) was selected as the substitution model that best fit the data according to a hierarchical likelihood ratio test. Because the position of species P. guttata is fixed as the outgroup, there are only three possible topologies, (2,(1,3)), (3,(1,2)), and (1,(2,3)). Here 1, 2, 3 represent the species Poephila acuticauda, P. hecki, P. cincta, respectively. From Table 1, there are 15 genes out of 30 whose estimates of the gene tree support the tree (3,(1,2)). The average probability for (3,(1,2)) across the 30 genes is 0.434. The corresponding probabilities for the other two possible trees are 0.205 and 0.361. Thus the tree (3,(1,2)) has slightly more support from the gene trees than (2,(1,3)) and (1,(3,2)). One way to estimate a species tree from multiple gene trees when there are three taxa is via a majority-rule criterion, whereby the gene tree whose topology is found most frequently is presumed to reflect the topology of the species tree. The majority-rule estimate of the species tree is thus the second tree, (3,(1,2)).
|
Bayesian estimation of the species tree for the concatenation method
In this case, the multilocus sequences are concatenated into a single sequence. The concatenated data was analyzed in MrBayes with an HKY85 substitution model. The prior for the topology was taken to be a uniform distribution, and branch lengths were assumed to be independently distributed exponentials. The resulting estimated topology of the species tree is in Table 1. It matches the majority-rule tree in (1) and its posterior probability is essentially one.
Bayesian estimation of gene tees, topology of species trees, effective population sizes, and divergence times using the proposed method
The finch data set was analyzed in MrBayes. The posterior distribution of gene trees was estimated with an HKY85 substitution model and the joint prior of gene trees across 30 genes. The estimated joint gene trees were then used to reconstruct the species tree using MCMC as implemented in the program Bayesian Estimation of Species Tree reconstruction program (BEST). Three different priors of effective population sizes were used to evaluate the effect of the prior on the posterior distribution. The priors are exponential distributions with means 1, 0.1, 0.0072, and 0.00072. The median of these four priors for effective population sizes (in the units of substitutions per site) are then 0.693, 0.069, 0.005, and 0.0005. They reflect the user's initial guess about the population sizes. To convert the posterior medians of these parameters to estimates of the posterior speciation times in years and effective population sizes, we assume the mutation rate is 3.6 x 10– 9, as in Jennings and Edwards (2005).
The same species tree with strong support is estimated regardless of which of the priors we use (Table 1). Our estimate of the species tree agrees with the one estimated by the concatenation method except that the support for the clade (1,2) is essentially one for the concatenation method and the support of the same clade is approximately 0.88 for the BEST method (Table 1).
To compare the BEST method with the concatenation method, we estimate the Bayes factor using the harmonic mean of the likelihood. Although the harmonic mean method can be somewhat unstable and sensitive to the lowest value of the likelihood (Lartillot and Philippe, 2005), it works here because the likelihoods for the two different methods are well separated (Fig. 1). The Bayes factor suggests that the coalescent model fits the data better than the concatenation method.
|
The posterior estimates of the divergence times are similar for the different priors (Table 2), indicating a strong signal in the data for these parameters. On the other hand, the posterior distribution of the population sizes does appear to be sensitive to the prior chosen. The estimate is strongly correlated to the median of the prior for the clade (1,2), whereas the clade (1,2,3) is relatively insensitive to the priors (Table 2).
|
The gene trees are correlated as a consequence of their joint dependence on the species tree. We should use a joint distribution to formulate the prior of gene trees from different genes. In our model, the joint distribution is derived from coalescent theory. Let Gi be the gene tree for gene i and S be the species trees. The joint distribution of gene trees is given by
|
|
) follows coalescent theory. f(G1 ... Gk) tends to put more weight on gene trees with similar topologies and branch lengths. This can be seen from the posterior probabilities of the gene trees in Table 1. There are 22 genes supporting the tree (3,(,1,2)) in Table 1, compared with only 15 genes when the independent prior was used (Table 1). The average support probability for (3,(1,2)) is 0.508, which is increased by 0.074 (17%) from the average support under the independent prior. Interestingly, we can see the pattern of the change of the posterior probability of gene trees. Consider genes Pa-4, Pa-5, and Pa-18. For genes Pa-4 and Pa-5, both posteriors under the independence model favor the third topology in Table 1. However, after adjusting for the species coalescent process, their posteriors change to favor the second topology, which is the Bayesian estimate for the majority of genes. This is because gene trees are correlated in f(G) and the topology of a particular gene tree depends on the gene trees for other genes. If a majority of genes support the same topology, it will make the rest of the genes more likely to have a similar topology. But if the support is extremely strong as for the gene Pa-18 that supports the third topology with probability near 1, its posterior may not be changed even if the joint prior is used. A similar pattern is seen with Pa-21,-16, and-30 in Table 1. Our estimate of the species tree agrees with the assumed species phylogeny found by Jennings and Edwards (2005). The posterior probability of the species tree topology is around 0.9 no matter what prior we used. The estimate of the divergence time for (1,2) using our method is similar to the estimate given by Jennings and Edwards. However, our estimate for the clade ((1,2),3) is 0.00418, which is higher than Jenning and Edwards' estimate (0.00254). For the population size, both methods have the estimate for the clade (1,2,3) around 0.005. However, the estimates of the population size of clade (1,2) are sensitive to the prior for both techniques.
Analysis of Multilocus Macaque Data
Tosi and Morales (2003) isolated total genomic DNA of 63 macaques from 19 species and eight outgroup taxa. The DNA sequences were obtained from Y-chromosomal loci, mtDNA, C4 long intron 9, and IRBP intron 3. In their analysis, the ML tree was estimated for each gene, assuming an HKY85+G substitution model. The four different gene trees were used to make inference on the pattern of the species tree, but no method was available to combine the data, and concatenating the sequences was deemed inappropriate. Divergence times were estimated only for the Y-chromosomal and mitochondrial trees.
Here we analyze a reduced data set of 19 species, including one outgroup taxa (T. gelada), for which there are data on all four "genes." We randomly chose an allele from each species. The modified data were analyzed using the proposed method to estimate the posterior of gene trees and species trees. Further, a sensitivity analysis was performed to investigate the influence of each gene on the overall estimate of species trees.
Estimate of the species tree using the BEST algorithm
The effective population sizes are parameters in the likelihood of gene trees given species trees. In theory, Y-chromosomal and mitochondrial genes are uniparentally inherited and haploid, making their effective population sizes one fourth that of autosomal markers (Tosi and Morales, 2003). This data set is a mixture of Y-chromosomal, mitochondrial genes, and autosomal genes. According to the coalescent theory, the probability that the gene tree matches the species tree depends on the ratio of branch length and the effective population size. Thus, to make the data from the four genes comparable, the 1-to-4 effective population size adjustment based on the mode of inheritance was made in our analysis. The concatenation method does not apply to this example because the genes in the data have different effective population sizes and a different mode of inheritance (Miyamoto and Fitch, 1995; Moore, 1995; Ruvolo, 1997).
Fooden (1980) defined four species groups for macaques according to distinct forms of male reproductive anatomy. The species groups include the silenus group, fascicularis group, sinica group, and arctoides group. Our estimate of the species tree identified the silenus group with relative high posterior probability (Fig. 2). The species in the other three groups are not well resolved. This indicates inadequate information in the dataset and that more genes or alleles may be needed for estimating the species tree of macaques.
|
It is interesting to compare the posterior of the gene trees with the posterior of species trees. Let D(T1,T2) be the symmetric distance between two random trees T1 and T2 (Robinson and Foulds, 1981). Table 3 provides the average ± standard deviation of the distribution of distances between the gene tree, T1, and the species tree, T2, based on the posterior distributions computed under both the independence model and the coalescent model. The distances for the independent gene model are larger than the distances for the joint coalescent-based model for all four genes. This result suggests that the joint model makes the gene trees closer to the estimated species tree than the independent gene model.
|
Comparison of coalescent-based model with the independent gene model
Our method assumes the joint estimate of the posterior of gene trees must be compatible with the species tree while common analyses assume that loci are independent. To evaluate the effect of different priors on the posterior distribution of gene trees, we want to know if the posterior distributions of gene trees using two different priors are different. To test if two distributions are different, we introduce a theorem by Maa et al. (1996).
We want to test the hypothesis H: F1 = F2, where F1 and F2 are two distributions such as the two posterior tree distributions examined here. Let X1 and X2 be independent and identically distributed random draws from F1 and independent of Y1 and Y2 from F2. Take D(..) to be any appropriately chosen distance function. The theorem posits that F1 = F2 if and only if D(X1, X2) = D(Y1, Y2) = D(X1, Y1) in distribution.
To apply the theorem, we calculated the three distances (two within group distances and the between groups distance) for each gene in Table 4. The results show that the three distances are quite different for all genes, indicating that the joint prior and the independent prior result in two significantly different posterior distributions of gene trees for this data set.
|
Sensitivity analysis
Genes may have different influence on the posterior distribution of species trees. We examined the potential gene-by-gene sensitivity of our results by eliminating each single gene from the analysis in turn and reestimating the posterior of species trees. Let S1, S2, S3, and S4 be the posterior of species trees estimated without the Y-chromosome, mtDNA, C4 intron 9, and IRBP intron 3 data, respectively. Table 5 displays the distances between each Si and S (the posterior of species trees using all four genes). The mean distances for all four genes are comparable and the credible intervals for the four distances almost entirely overlap. This suggests that the estimation of the species tree is not overly subject to the strong influence of a single outlier gene.
|
| Discussion |
|---|
|
|
|---|
Different genes may have different mechanisms of evolution and support different phylogenetic relationships. Mixing this diverse set of circumstances together through concatenation is then inappropriate (Mossel and Vigoda, 2005) and may lead to difficulties in the performance of algorithms that do not recognize the problem explicitly. This work defines the correlation between gene trees through their common species tree but then, conditional on this species tree, allows for a completely independent evolutionary processes for each gene.
The Bayesian hierarchical model we have employed adopts coalescent theory to formulate the distribution of gene trees given the species trees. Our method is aimed at the problems in which lineage sorting is the only source of discrepancies between gene trees and species trees. The user should remain cautious when using the BEST technique to analyze the data in which gene flows such as horizontal transfer or hybridization may have commonly occurred. The Bayesian hierarchical model assumes that gene split times are earlier than speciation times. Thus, after-speciation gene flow may result in shallow branches in gene trees that strongly restrict the corresponding speciation times and may mislead the species tree estimation. A simulation study has found that increasing the number of loci gives more accurate estimates of the species tree under our assumption that deep coalescence is the only reason for the conflicts between the gene tree and species tree (Maddison and Knowles, 2006). However, there are other biological factors that can affect the correspondence of species trees and gene trees. For example, horizontal transfer and gene duplication/loss may cause conflicts between gene trees and species trees. Unfortunately, it is challenging to model the underlying mechanism of horizontal transfer or gene duplication/loss without encountering problems with parameter identifiability using molecular data. In addition, incorporating these events in the model will dramatically increase the number of parameters and result in huge demand of computation. Further work should permit incorporation of these issues into diagnostics of model adequacy and into estimates of species trees provided they are rare across genes. Thus, this first version of the Bayesian hierarchical model, based on coalescent theory, is a good starting point that can easily be generalized to use more robust models of the coalescent process that are available.
We have discussed the effect of priors of the effective population size on species tree estimation. Three data sets—finch, macaque, and yeast (Edwards et al., 2007)—have been analyzed by the Bayesian hierarchical model. In the results of theses analysis, the estimate of the topology and divergence times of the species tree is reasonably robust to changes in the prior of the effective population size although naturally the estimate of the effective population size itself will be affected. The model should be used to estimate ancestral population sizes only with extreme caution. The sensitivity of the estimates of the population sizes to the prior implies that either the prior is inappropriate or the information content in the data is low or the likelihood is incorrect. Other empirical analyses suggest that ancestral population sizes are difficult to estimate under a wide variety of circumstances (Takahata, 1989; Yang, 1997; Jennings and Edwards, 2005).
In this paper, we did not exploit the use of possible prior knowledge of the species trees. We used a birth-and-death process as the prior of species trees. Further work should incorporate other priors to see the effect on the estimation of species tree and joint gene tree distributions. For example, we might use a more informed prior centered on the distribution arising from other types of data such as behavioral or morphological information or we might use a less informed prior in which all species trees are equally likely within some bounds. Of course, this choice should be left to the individual investigator, but it is important to understand the relative contribution it plays in determining the posterior compared with the information in the data. If it is desired to have the phylogeny completely resolved by the data alone, then the accumulation of more genomic data may be needed.
An important by-product of our model is the joint prior of gene trees. Our model formulates the correlation structure of gene trees across different genes through coalescent theory and the birth-death process. The correlation structure will be more realistic if our model includes key factors such as horizontal transfer or gene duplication and uses a more appropriate prior for the species tree and population sizes. Thus, further research on model formulation is necessary. But the most important thing we stress here is the novel approach to estimating gene trees by employing the joint development of gene trees that are compatible with the species tree. Most current approaches assume independent loci. It would be more reasonable to assume the loci are conditionally independent (given the species tree) but marginally dependent. Our method suggests that gene trees and species trees should be estimated simultaneously, and that likelihood-based species tree estimation requires an explicit model and corresponding algorithms not traditionally included in phylogenetic analysis.
|
| Acknowledgements |
|---|
We would like to thank Scott Edwards, Bryan Jennings, and Anthony Tosi for providing the sequences data. Special thanks to Scott Edwards for his insightful comments and suggestions for improving the model. We thank Thomas Buckley, Lacey Knowles, and Roderic Page for their constructive comments on an earlier version of this paper. This work was partially supported by grant NSF DMS-0112050 to the Mathematical Biosciences Institute.
| Notes |
|---|
|
|
|---|
2 Current address: Department of Organismic and Evolutionary Biology, Harvard University, Cambridge, MA, 02138, USA; E-mail: lliu{at}oeb.harvard.edu
| References |
|---|
|
|
|---|
-
Altekar G., Dwarkadas S., Huelsenbeck J. P., Ronquist F. Parallel Metropolis-coupled Markov chain Monte Carlo for Bayesian phylogenetic inference. Bioinformatics (2004) 20:407–415.
Arvestad L., Berglund A. C., Lagergren J., Sennblad B. Gene tree reconstruction and orthology analysis based on an integrated model for duplications and sequence evolution. RECOMB (2004) 326–335.
Barlow M., Hall B. G. Origin and evolution of the AmpC β-lactamases of citrobacter freundii. Antimicrob. Agents Chemother. (2002) 46:1190–1198.
Buckley T. R., Cordeiro M., Marshall D. C., Simon C. Differentiating between hypotheses of lineage sorting and introgression in New Zealand alpine cicadas (Maoricicada dugdale). Syst. Biol. (2006) 55:411–425.
Bull J. J., Huelsenbeck J. P., Cunningham C. W., Swofford D. L., Waddell P. J. Partitioning and combining data in phylogenetic analysis. Syst. Biol. (1993) 42:384–397.
Chen F. C., Li W. H. Genomic divergences between humans and other hominoids and the effective population size of the common ancestor of humans and chimpanzees. Am. J. Hum. Genet. (2001) 68:444–456.[CrossRef][Web of Science][Medline]
Coop G., Griffiths R. C. Ancestral inference on gene trees under selection. Theor. Popul. Biol. (2004) 66:219–232.[CrossRef][Web of Science][Medline]
Cotton J. A., Page R. D. M. Going nuclear: Gene family evolution and vertebrate phylogeny reconciled. Proc. R. Soc. Lond. B. (2002) 269:1555–1561.[Medline]
Degnan J. H., Rosenberg N. A. Discordance of species trees with their most likely gene trees. PLoS Genet. (2006) 2:762–768.[Web of Science]
Degnan J. H., Salter L. Gene tree distributions under the coalescent process. Evolution (2005) 59:24–37.[Web of Science][Medline]
de Queiroz A. For consensus (sometimes). Syst Biol. (1993) 42:368–372.
Doyle J. J. Gene trees and species trees: Molecular systematics as one-character taxonomy. Syst. Bot. (1992) 17:144–163.[CrossRef]
Durand D., Halldorsson B. V., Vernot B. A hybrid micro-macroevolutionary approach to gene tree reconstruction. J. Comput. Biol. (2006) 13:320–335.[CrossRef][Web of Science][Medline]
Edwards S., Beerli P. Perspective: Gene divergence, population divergence, and the variance in coalescence time in phylogeographic studies. Evolution (2000) 54:1839–1854.[CrossRef][Web of Science][Medline]
Edwards S. V., Liu L., Pearl D. K. High resolution species trees without concatenation. Proc. Natl. Acad. Sci. (2007) 104:5936–5941.
Felsenstein J. Inferring phylogenies (2004a) Sunderland, Massachusetts: Sinauer Associates.
Felsenstein J. PHYLIP (Phylogeny Inference Package). Version 3.6. Distributed by the editor (2004b) Seattle: Department of Genome Sciences, University of Washington.
Fooden J. Classification and distribution of living macaques. In: The macaques: Studies in ecology, behavior, and evolution—Lindburg D. G., ed. (1980) New York: Van Nostrand Reinhold. 1–9.
Hasegawa M., Kishino H., Yano T. Dating the human-ape split by a molecular clock of mitochondrial DNA. J. Mol. Evol. (1985) 22:160–174.[CrossRef][Web of Science][Medline]
Huelsenbeck J. P., Ronquist F. MrBayes: Bayesian inference of phylogeny. Bioinformatics (2001) 17:754–755.
Huelsenbeck J. P., Swofford D. L., Cunningham C. W., Bull J. J., Waddell P. J. Is character weighting a panacea for the problem of data heterogeneity in phylogenetic analysis? Syst Biol. (1994) 43:288–291.
Jennings W. B., Edwards S. V. Speciational history of Australian grass finches (Poephila) inferred from thirty gene trees. Evolution (2005) 59:2033–2047.[Web of Science][Medline]
Kluge A. G. A concern for evidence and a phylogenetic hypothesis of relationships among Epicrates (Boidae, Serpentes). Syst. Zool. (1989) 38:7–25.[Abstract]
Kluge A. G., Wolf A. J. Cladistics: What's in a word. Cladistics. (1993) 9:183–199.[CrossRef][Web of Science]
Kubatko L. S., Degnan J. H. Inconsistency of phylogenetic estimates from concatenated data under coalescence. Syst. Biol. (2006) 56:17–24.[CrossRef][Web of Science]
Lartillot N., Philippe H. Computing Bayes factors using thermodynamic integration. Syst. Biol. (2005) 55:195–207.[CrossRef][Web of Science]
Maa J. F., Pearl D. K., Bartoszytiski R. Reducing multidimensional two-sample data to one-dimensional interpoint distances. Ann. Stat. (1996) 24:1069–1074.[CrossRef][Web of Science]
Maddison W. P. Gene trees in species trees. Syst. Biol. (1997) 46:523–536.
Maddison W. P., Knowles L. L. Inferring phylogeny despite incomplete lineage sorting. Syst Biol. (2006) 55:21–30.
Minin V., Abdo Z., Joyce P., Sullivan J. Performance-based selection of likelihood models for phylogeny estimation. Syst. Biol. (2003) 52:674–683.
Miyamoto M. M., Fitch W. M. Testing species phylogenies and phylogenetic methods with congruence. Syst. Biol. (1995) 44:64–76.
Moore W. S. Inferring phylogenies from mtDNA variation: Mitochondrial gene trees vs. nuclear gene trees. Evolution (1995) 49:718–26.
Mossel E., Vigoda E. Phylogenetic MCMC algorithms are misleading on mixtures of trees. Science (2005) 309:2207–2209.
Nee S., May R. M., Harvey P. H. The reconstructed evolutionary process. R. Soc. Lond. Proc. Ser. B (1994) 344:77–82.[CrossRef]
Newton M. A., Raftery A. E. Approximate Bayesian inference by the weighted likelihood bootstrap (with discussion). J. R. Stat. Soc. Ser. B (1994) 56:3–48.
Nielsen R. Maximum likelihood estimation of population divergence times and population phylogenies under the infinite sites model. Theor. Popul. Biol. (1998) 53:143–151.[CrossRef][Web of Science][Medline]
Nielsen R., Mountain J. L., Huelsenbeck J. P., Slatkin M. Maximum-likelihood estimation of population divergence times and population phylogeny in models without mutation. Evolution (1998) 52:669–677.[CrossRef][Web of Science]
Nixon K. C., Carpenter J. M. On simultaneous analysis. Cladistics (1996) 12:221–241.[CrossRef][Web of Science]
Nylander J. A. A., Ronquist F., Huelsenbeck J. P., Nieves Aldrey J. L. Bayesian phylogenetic analysis of combined data. Syst. Biol. (2004) 53:47–67.
Page R. D. M. GeneTree: Comparing gene and species phylogenies using reconciled trees. Bioinformatics (1998) 14:819–820.
Pamilo P., Nei M. Relationships between gene trees and species trees. Mol. Biol. Evol. (1988) 5:568–583.[Abstract]
Posada D., Crandall K. A. ModelTest: Testing the model of DNA substitution. Bioinformatics (1998) 14:817–818.
Rannala B., Yang Z. Bayes estimation of species divergence times and ancestral population sizes using DNA sequences from multiple loci. Genetics (2003) 164:1645–1656.
Robinson D. F., Foulds L. R. Comparison of phylogenetic trees. Math. Biosci. (1981) 53:131–147.[CrossRef][Web of Science]
Rodrigo A. G., Kellyborges M., Bergquist P. R., Bergquist P. L. A randomization test of the null hypothesis that two cladograms are sample estimates of a parametric phylogenetic tree. N. Z. J Bot. (1993) 31:257–268.
Ronquist F., Huelsenbeck J. P. MrBayes 3: Bayesian phylogenetic inference under mixed models. Bioinformatics (2003) 19:1572–1574.
Ronquist F., Huelsenbeck J. P. J. P., van der Mark P. MrBayes3 manual (2005) On-line at http://mrbayes.csit.fsu.edu/mb3.1_manual.pdf.
Ruvolo M. Molecular phylogeny of the hominoids: Inferences from multiple independent DNA sequence data sets. Mol. Biol. Evol. (1997) 14:248–265.[Abstract]
Simmons M. P., Bailey C. D., Nixon K. C. Phylogeny reconstruction using duplicate genes. Mol. Biol. Evol. (2000) 17:469–473.
Slatkin M., Pollack J. L. The concordance of gene trees and species trees at two linked loci. Genetics (2006) 172:1979–1984.
Slowinski J. B., Page R. D. M. How should species phylogenies be inferred from sequence data? Syst. Biol. (1999) 48:814–825.
Takahata N. Gene genealogy in three related populations: Consistency probability between gene and population trees. Genetics (1989) 122:957–966.
Tosi A. J., Morales J. C., Melnick D. J. Paternal, maternal, and biparental molecular markers provide unique windows onto the evolutionary history of macaques monkeys. Evolution (2003) 57:1419–1435.[CrossRef][Web of Science][Medline]
Yang Z. On the estimation of ancestral population sizes. Genet. Res. (1997) 69:111–116.[CrossRef][Web of Science][Medline]
Yang Z., Rannala B. Bayesian phylogenetic inference using DNA sequences: A Markov chain Monte Carlo method. Mol. Biol. Evol. (1997) 14:717–724.[Abstract]
This article has been cited by other articles:
![]() |
A. D. Leache Species Tree Discordance Traces to Phylogeographic Clade Boundaries in North American Fence Lizards (Sceloporus) Syst Biol, December 1, 2009; 58(6): 547 - 559. [Abstract] [Full Text] [PDF] |
||||
![]() |
S. Simon, S. Strauss, A. von Haeseler, and H. Hadrys A Phylogenomic Approach to Resolve the Basal Pterygote Divergence Mol. Biol. Evol., December 1, 2009; 26(12): 2719 - 2730. [Abstract] [Full Text] [PDF] |
||||
![]() |
B. C. O'Meara New Heuristic Methods for Joint Species Delimitation and Species Tree Inference Syst Biol, November 10, 2009; (2009) syp077v1. [Abstract] [Full Text] [PDF] |
||||
![]() |
E. W. Bloomquist and M. A. Suchard Unifying Vertical and Nonvertical Evolution: A Stochastic ARG-based Framework Syst Biol, November 9, 2009; (2009) syp076v1. [Abstract] [Full Text] [PDF] |
||||
![]() |
L. Liu, L. Yu, D. K. Pearl, and S. V. Edwards Estimating Species Phylogenies Using Coalescence Times among Sequences Syst Biol, October 1, 2009; 58(5): 468 - 477. [Abstract] [Full Text] [PDF] |
||||
![]() |
J. E. McCormack, H. Huang, and L. L. Knowles Maximum Likelihood Estimates of Species Trees: How Accuracy of Phylogenetic Inference Depends upon the Divergence History and Sampling Design Syst Biol, October 1, 2009; 58(5): 501 - 508. [Abstract] [Full Text] [PDF] |
||||
![]() |
H. Huang and L. L. Knowles What Is the Danger of the Anomaly Zone for Empirical Phylogenetics? Syst Biol, October 1, 2009; 58(5): 527 - 536. [Abstract] [Full Text] [PDF] |
||||
![]() |
L. S. Kubatko Identifying Hybridization Events in the Presence of Coalescence via Model Selection Syst Biol, October 1, 2009; 58(5): 478 - 488. [Abstract] [Full Text] [PDF] |
||||
![]() |
K. A. Cranston, B. Hurwitz, D. Ware, L. Stein, and R. A. Wing Species Trees from Highly Incongruent Gene Trees in Rice Syst Biol, October 1, 2009; 58(5): 489 - 500. [Abstract] [Full Text] [PDF] |
||||
![]() |
L. Liu and S. V. Edwards Phylogenetic Analysis in the Anomaly Zone Syst Biol, August 1, 2009; 58(4): 452 - 460. [Full Text] [PDF] |
||||
![]() |
W. K. Savage and S. P. Mullen A single origin of Batesian mimicry among hybridizing populations of admiral butterflies (Limenitis arthemis) rejects an evolutionary reversion to the ancestral phenotype Proc R Soc B, July 22, 2009; 276(1667): 2557 - 2565. [Abstract] [Full Text] [PDF] |
||||
![]() |
B. Frajman, F. Eggens, and B. Oxelman Hybrid Origins and Homoploid Reticulate Evolution within Heliosperma (Sileneae, Caryophyllaceae)--A Multigene Phylogenetic Approach with Relative Dating Syst Biol, July 3, 2009; (2009) syp030v1. [Abstract] [Full Text] [PDF] |
||||
![]() |
J. H. Degnan, M. DeGiorgio, D. Bryant, and N. A. Rosenberg Properties of Consensus Methods for Inferring Species Trees from Gene Trees Syst Biol, June 4, 2009; (2009) syp008v1. [Abstract] [Full Text] [PDF] |
||||
![]() |
P. Q. Spinks and H. B. Shaffer Conflicting Mitochondrial and Nuclear Phylogenies for the Widely Disjunct Emys (Testudines: Emydidae) Species Complex, and What They Tell Us about Biogeography and Hybridization Syst Biol, May 28, 2009; (2009) syp005v1. [Abstract] [Full Text] [PDF] |
||||
![]() |
S. Joly and A. Bruneau Measuring Branch Support in Species Trees Obtained by Gene Tree Parsimony Syst Biol, May 25, 2009; (2009) syp013v1. [Abstract] [Full Text] [PDF] |
||||
![]() |
C. M. Bossu and T. J. Near Gene Trees Reveal Repeated Instances of Mitochondrial DNA Introgression in Orangethroat Darters (Percidae: Etheostoma) Syst Biol, May 22, 2009; (2009) syp014v1. [Abstract] [Full Text] [PDF] |
||||
![]() |
O. Akerborg, B. Sennblad, L. Arvestad, and J. Lagergren Simultaneous Bayesian gene tree reconstruction and reconciliation analysis PNAS, April 7, 2009; 106(14): 5714 - 5719. [Abstract] [Full Text] [PDF] |
||||
![]() |
L. S. Kubatko, B. C. Carstens, and L. L. Knowles STEM: species tree estimation using maximum likelihood for gene trees under coalescence Bioinformatics, April 1, 2009; 25(7): 971 - 973. [Abstract] [Full Text] [PDF] |
||||
![]() |
S. Mathews Phylogenetic relationships among seed plants: Persistent questions and the limits of molecular data Am. J. Botany, January 1, 2009; 96(1): 228 - 236. [Abstract] [Full Text] [PDF] |
||||
![]() |
N. Galtier and V. Daubin Dealing with incongruence in phylogenomic analyses Phil Trans R Soc B, December 27, 2008; 363(1512): 4023 - 4029. [Abstract] [Full Text] [PDF] |
||||
![]() |
C. R. Linnen and B. D. Farrell Comparison of Methods for Species-Tree Inference in the Sawfly Genus Neodiprion (Hymenoptera: Diprionidae) Syst Biol, December 1, 2008; 57(6): 876 - 890. [Abstract] [Full Text] [PDF] |
||||
![]() |
R. T. Brumfield, L. Liu, D. E. Lum, and S. V. Edwards Comparison of Species Tree Methods for Reconstructing the Phylogeny of Bearded Manakins (Aves: Pipridae, Manacus) from Multilocus Sequence Data Syst Biol, October 1, 2008; 57(5): 719 - 731. [Abstract] [Full Text] [PDF] |
||||
![]() |
B. R. Holland, A. C. Clarke, and H. M. Meudt Optimizing Automated AFLP Scoring Parameters to Improve Phylogenetic Resolution Syst Biol, June 1, 2008; 57(3): 347 - 366. [Abstract] [Full Text] [PDF] |
||||
![]() |
J. C. Avise and T. J. Robinson Hemiplasy: A New Term in the Lexicon of Phylogenetics Syst Biol, June 1, 2008; 57(3): 503 - 507. [Full Text] [PDF] |
||||
![]() |
M. Steel and A. Rodrigo Maximum Likelihood Supertrees Syst Biol, April 1, 2008; 57(2): 243 - 250. [Abstract] [Full Text] [PDF] |
||||
![]() |
N. M. Belfiore, L. Liu, and C. Moritz Multilocus Phylogenetics of a Rapid Radiation in the Genus Thomomys (Rodentia: Geomyidae) Syst Biol, April 1, 2008; 57(2): 294 - 310. [Abstract] [Full Text] [PDF] |
||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||

















