© 2008 Society of Systematic Biologists
On the Distributions of Bootstrap Support and Posterior Distributions for a Star Tree
Edited by Paul Lewis
Department of Mathematics and Statistics, Dalhousie University Halifax, Nova Scotia, Canada B3H 3J5; E-mail: susko{at}mathstat.dal.ca
| Abstract |
|---|
|
|
|---|
Several authors have recently noted that when data are generated from a star topology, posterior probabilities can often be very large, even with arbitrarily large sequence lengths. This is counter to intuition, which suggests convergence to the limit of equal probability for each topology. Here the limiting distributions of bootstrap support and posterior probabilities are obtained for a four-taxon star tree. Theoretical results are given, providing confirmation that this counterintuitive phenomenon holds for both posterior probabilities and bootstrap support. For large samples the limiting results for posterior probabilities are the same regardless of the prior. With equal-length terminal edges, the limiting distribution is similar but not the same across different choices for the lengths of the edges. In contrast to previous results, the case of unequal lengths of terminal edges is considered. With two long edges, the posterior probability of the tree with long edges together tends to be much larger. Using the neighbor-joining algorithm, with equal edge lengths, the distribution of bootstrap support tends to be qualitatively comparable to posterior probabilities. As with posterior probabilities, when two of the edges are long, bootstrap support for the tree with long branches together tends to be large. The bias is less pronounced, however, as the distribution of bootstrap support gets close to uniform for this tree, whereas posterior probabilities are much more likely to be large. Our findings for maximum likelihood estimation are based entirely on simulation and in contrast suggest that bootstrap support tends to be fairly constant across edge-length choices.
Keywords: Bootstrap support; posterior probability; star tree
Received January 7, 2008; Revised February 26, 2008; Accepted April 28, 2008
When data are generated from a four-taxon star topology, most people's intuition would lead them to expect statistical measures of uncertainty to show little or no support for any one of the resolved topologies (Fig. 1). Suzuki et al. (2002) seem to have been the first to have noticed that posterior probabilities for resolved topologies could occasionally be quite large. More surprisingly, they found that this phenomenon, which has come to be referred to as the star tree paradox, persists with large sequence lengths.
|
A number of authors took note of and followed up on the observations in Suzuki et al. (2002). Cummings et al. (2003) considered the relationship between bootstrap support and posterior probabilities with a four-taxon star topology, finding that values were, by and large, comparable and that both showed considerable variation. Because they used sequence lengths of 1000, the possibility was left open that bootstrap support and posterior probabilities would eventually converge upon some value. Lewis et al. (2005) conducted the first focused study on the issue, finding that variation in posterior probabilities persisted even with very large sequence lengths of 100,000. They noted that the paradox arose more generally in simpler problems such as coin-tossing experiments. As a remedy, they suggested placing unresolved and resolved trees on an equal footing by placing positive prior probability on unresolved trees. The work of Yang and Rannala (2006) expanded on some of the themes in Lewis et al. (2005). They gave further examples of simple problems where analogous issues arose and pointed out that such problems had a long history (Lindley, 1957; Shafer, 1982). They also extended the scope of remedies to include priors that place increasing density near unresolved trees.
Kolaczowski and Thornton (2006) argued that there is no star tree paradox. To examine limiting behavior, they conducted repeated MCMC analyses using expected pattern frequencies for star tree models as data and found posteriors close to 1/3 with little variation. They also found, however, that variance in posterior probabilities was not diminished even with sequence lengths as large as 107 and posteriors larger than 0.95 were occasionally found with large sequence lengths. Steel and Matsen (2007) proved that no matter how large the threshold, there was positive probability that any of the posteriors would be greater than it. Yang (2007) recently reconsidered the paradox as well. Using simulations and Laplace approximations similar to the ones here, the notion of a fixed limit for posterior probabilities was dismissed and, as in the earlier works of Lewis et al. (2005) and Yang and Rannala (2006), posterior probabilities with increasing mass near the star tree were put forward as a remedy.
These results are elaborated on here. Some of the reasons for continuing interest in the issue is to gain a more complete understanding of this surprising result and to see how far the results can be extended; extension here is primarily to differing edge lengths with additional results for equal edge lengths. Because star trees are the most extreme case of a poorly resolved tree, the approximations also provide a better understanding of the large sequence length properties of commonly used statistical measures of topological uncertainty when topologies are not well resolved. The interest in the paradox is, however, unusual in being a frequentist study of Bayesian probabilities; the generating star tree is fixed in analyses. Huelsenbeck and Rannala (2004) and Yang and Rannala (2006) make the important point, relevant to the study here, that Bayesian posteriors are not designed to have frequentist properties. Posterior probabilies are not supposed to yield 5% "type I errors," when data are generated from fixed parameters; indeed the notion of a type I error is not clearly relevant in a Bayesian context. Although it may thus be no surprise when Bayesian probabilities do not have certain properties in a frequentist simulation setting, this does not imply that what properties they do have in such settings are not of interest. They are of interest to researchers who are increasingly presented with, and need to make comparisons between, results presented according to both frequentist and Bayesian paradigms.
Characterizations of the distribution of posterior probabilities in a four-taxon case will be obtained here. The approach is similar to Yang (2007) in that Laplace approximations to posterior numerator terms are used. The limiting distribution turns out to be expressible as a transformation of dependent uniform and chi-squared random variables or, alternatively, as transformations of normal random variables. One of the merits of the theoretical approach is that it indicates that the results are not dependent on the prior.
Consistent with previous findings, even as sequence lengths get arbitrarily large, posterior probabilities do not converge to a fixed value—in particular, not to the anticipated fixed value of 1/3—but rather have a limiting distribution with positive variance. In contrast to most previous studies, consequences of unequal edge lengths in the tree are considered. When two of the edges for the star tree are long and the others are short, it turns out that the posterior probabilities for the topology with the long branches together tend to be much larger.
The main other measure of statistical uncertainty used in practice is bootstrap support. Although the rationales for the two measures are quite different, because they both arise in practice, there is considerable interest in whether they are comparable and in when and why they tend to be different. Our theoretical results for bootstrap support are limited to the more tractable case of the neighbor-joining algorithm (Saitou and Nei, 1987). The distribution of bootstrap support tends to be comparable to posterior probabilities for equal edge lengths. As the lengths of the long edges increase, much like posterior probabilities, bootstrap support for the long-branches-together tree tends to become larger, eventually getting close to a uniform distribution.
Simulation was required to obtain the results for likelihood methods. Surprisingly, the distribution is comparable to that of posterior probabilities with equal edge lengths but it remains relatively stable as the long edges increase.
| Methods |
|---|
|
|
|---|
The Limiting Distribution of Posterior Probability
Similarly as in Yang, Laplace approximation (cf. Tierney and Kadane, 1986) is used to obtain the limiting distribution of the posterior distributions. In contrast to Yang, where approximations were based on expansion about the maximum likelihood estimator, the approximations used here are about the true star tree parameters. This makes approximation easier. For instance, one of the issues that Yang needed to deal with was the possible nonsingularity of the information matrix on random generations. Only one information matrix, the one corresponding to expansion about the true parameter values, is used here so that singularity either holds or it doesn't, once and for all. The derivation of the limiting distribution is given in Appendix A.
Assumptions used in obtaining the result are as follows. It is assumed that the terminal edges in the generating tree are all positive. It is also assumed that substitutions along an edge are according to a Markov model with nonzero frequencies of character states and nonzero rates of exchange. Together these assumptions imply that probabilities of any pattern of character states at a site are nonzero. It is also assumed that, for a given topology, edge lengths are identifiable: no two sets of edge lengths give exactly the same pattern probabilities. All of the examples use the Jukes-Cantor model (Jukes and Cantor, 1969), which, with positive terminal edges, satisfies the above assumptions. At least in a neighborhood of the true edges, the prior,
(t), for the edge lengths is assumed to be continuous, positive, and have a bounded derivative. This assumption is satisfied for common priors like the uniform or exponential.
To establish some notation, label the four taxon topologies j = 1, 2, and 3 according to the neighbor, j+1, of 1 in Figure 1. Let t* = [t1*,...,t4*,0] denote the generating edge lengths for the star topology. Let
n Sjn be the gradient of the log-likelihood for the jth topology, lj(t), evaluated at t*, having rth element
/
trlj(t*), and let Ij denote the covariance matrix of Sjn;
n Sjn is sometimes referred to as the score function and nIj are the expected information matrix, where n is the number of sites. As mentioned above, Ij is assumed positive definite.
The limiting distribution of posterior probability for the jth topology is the same as that of
|
| (1) |
To briefly outline the reasons for the limiting distribution, the posterior for topology j can be expressed as
j/
k
k, where
|
| (2) |
|
| (3) |
(z) = P(Z
z) for a standard normal random variable Z. The second factor gives the Uj and Xj2 = SjnTIj–1Sjn, the limiting distributions of which follow from likelihood theory. As is shown in Appendix A, the posteriors can alternatively be characterized as having a limiting distribution that is the same as a known transformation of a normal random vector. In the results reported below, repeated simulation of normal random vectors and then transformation as described in Appendix A was used to obtain the distributions. Note that simulation here is much simpler than the usual approach. Usual approaches require repeat generations of large sequence alignments and then repeated generations within a Markov chain Monte Carlo (MCMC) algorithm, for each of these sequence alignments, to get posterior probabilities.
The Limiting Distribution of Bootstrap Support
In obtaining distributions for bootstrap support, attention will be restricted to the neighbor-joining algorithm and the Jukes-Cantor model (Jukes and Cantor, 1969); as before, nonzero terminal edge lengths are assumed. Although simulation results for maximum likelihood estimation will indicate that the distributions do not hold generally across different methods of estimation, I expect that they can be extended to other substitution models. Focus here is upon bootstrap support for topology 1, having taxa 1 and 2 as neighbors; the results imply the distributions for the other topologies, through a permutation of the taxon labels.
For the neighbor-joining algorithm, topology 1 is estimated if
|
| (4) |
) denote the probability that two standard normal random variates with correlation
are greater than v1 and v2, respectively. Let
* denote the limiting correlation between Z1 and Z2 in (4); an expression for
* is given in Appendix B. The main result is that the distribution of bootstrap support is the same as the distribution of
|
| (5) |
*. The derivation is given in Appendix B.
Special cases where explicit forms for the distribution can be obtained illustrate how the distribution changes as a function of
*. In the case that
* = 1, G gives the probability that two perfectly correlated standard normal random variables are greater than V1. Because they are perfectly correlated, this is the same as the probability that either one of them is greater than V1. In short, the limiting distribution of bootstrap support is the same as 1 –
(V1), where
is the standard normal cumulative distribution function. It can be shown that this random variable has a uniform distribution.
When the correlation is 0, the random variables are independent and it turns out that
|
|
What the special cases suggest is that as correlation between Z1 and Z2 increases, the probability of large bootstrap support gets larger and attains a maximum when the correlation is 1. For the Jukes-Cantor model, this special case turns out to be of particular interest.
| Results |
|---|
|
|
|---|
With equal edge lengths, distributions of posteriors are the same across different topologies and comparable but different across choices of edge lengths. Moreover, there is a nonnegligible probability of posterior probabilities being large, regardless of the sequence length. This is illustrated in Figure 2, which gives the limiting survivor functions for different choices of edge lengths. The survivor function gives the probabilities of having posterior probabilities larger than the corresponding value on the x-axis; it is equal to 1 minus the cumulative distribution function. Although unconventional, this interpretation is of particular interest in the present setting due to the surprisingly nonnegligible limiting frequencies of large posteriors. Because it is known that the distributions are the same for each topology, the survivor functions are plotted for the first topology alone.
|
The survivor functions were obtained through repeated simulation of normal vectors, W, with mean 0 and variance-covariance matrix
having entries given by (A1) to (A3) in Appendix A. Transformation (A4) then gives the posteriors. A total of 100,000 simulations were conducted to obtain each curve. Because simulation was used to obtain the survivor functions, pointwise 95% bounds have been included in the case that the equal edge lengths are 0.01 to give an indication of possible effects of simulation; they are visible only as a thicker line. Although the survivor functions are comparable for different choices of equal edge lengths, they differ as well. To check the large sample approximations against the results that one would obtain using MCMC techniques and fixed sequence lengths, MrBayes (Huelsenbeck and Ronquist, 2001; Ronquist and Huelsenbeck, 2003) was used. A total of 1000 simulations were conducted at each parameter setting with sequence length 25,000 from the Jukes-Cantor substitution model. That the Jukes-Cantor model is the correct substitution model was treated as known in the prior (nst=1, statfreqpr=fixed(equal)). A total of 10,000 MCMC generations were run with a sampling frequency of 10 and 25 burn-in samples (ngen=10000, samplefreq=10, sumt burnin=25). All other parameters were set to default values. The results are given in Figure 3 and indicate a very close fit between theoretical and actual distribution in the case that all edge lengths are equal.
Consider now the case where two of the terminal edges are long and two are short. Specifically, assume that the edges leading to 1 and 3 are long and that those leading to 2 and 4 are short. As discussed earlier, because the Xj2 and Uj in (1) have the same distribution regardless of the topology j, the main reason that a bias may enter into the posteriors is that the |Ij|–1/2 may be different. From (1) a rough measure of the posteriors that can be expected is what I refer to as the information ratio:
|
| (6) |
For any fixed choice of long branches that I have considered, as the ratio of short to long edge lengths increases from 0 to 1, the information ratio decreases from 1 to 1/3. This indicates that there will be a strong bias towards high posterior probabilities for the long-branch-attract topology 2 in the case that the ratio of short to long edge lengths is small. The bias towards high posteriors for the long-branch-attract topology 2 suggested by the information ratios is confirmed in Figure 3.
|
The distribution of bootstrap support for topology 1 using the neighbor-joining algorithm is determined by the correlation of Z1 and Z2 in (4). When the correlation is 1, the distribution is uniform and as the correlation decreases, the distribution becomes more skewed towards smaller values. I have calculated the correlations over a large range of long (l) and short (s) edge lengths. When l = s the correlation is 0.5, but, as l gets relatively large, the correlation approaches 1 for the tree with two long branches together and 0 for the tree with long branches apart. As discussed in the previous subsection, the distribution of bootstrap support is uniform when the correlation is 1 and the probability of bootstrap support less than v is v–vlog (v) when the correlation is 0. Figure 4 gives the limiting distributions of bootstrap support and the estimated survivor functions when sequence length is 10,000 based on 1000 simulations from the Jukes-Cantor model for the same edge-length settings and topologies as were considered in Figure 3. There is a very close fit between the theoretical and finite sequence-length distributions regardless of edge-length setting. As expected based on the calculated limiting correlations, distributions are very close to uniform (linear survivor functions) in cases when two of the edges are long and two are short.
|
Theoretical limiting distributions were difficult to obtain for bootstrap support using maximum likelihood estimation and thus relied entirely on simulation. The DNAML routine in the PHYLIP package (Felsenstein, 1989, 2004) was used to obtain maximum likelihood estimates. A total of 1000 simulations were conducted for each edge-length setting, each with sequence length 10,000; similar results were obtained with sequence lengths of 25,000. In contrast to bootstrap support for distance methods, the survivor functions are similar across all edge-length settings (Fig. 5) and, in fact, are similar to the limiting distributions of posteriors and bootstrap support for the neighbor-joining algorithm when the generating tree has all equal edge lengths (Fig. 6).
|
|
| Discussion |
|---|
|
|
|---|
The results indicating that bootstrap support and posteriors tend to be comparable when edge lengths are equal, but differ substantially when long edges are present, are similar to the findings in Cummings et al. (2003), who found the biggest differences in what they referred to as the two branch corner. The Cummings et al. (2003) star tree simulations found bootstrap support and posterior probabilities to be comparable. Because the simulation had equal expected (simulations did not use constant edges) external edge lengths, the findings are consistent with the ones here.
In the case of equal edge lengths, the distribution of posterior probabilities differed across choices of edge lengths. This finding differs somewhat from Yang (2007). However, the direct results there were for a three-taxon rooted tree. The analogue of
n Sjn in the three-taxon problem is the gradient of the likelihood, now lj(t1,t0); edge lengths t0 and t1 are as in figure 2 of Yang (2007). The information matrix, Ij, can still be taken as the covariance matrix of Sjn. With these changes in interpretation, the argument in Appendix A for the approximations to the posterior can be repeated to give (3). Substantial simplification of the score and information is possible in this three-taxon problem and can be used to show that the limiting distribution of the posterior for topology j is the same as that of
|
| (7) |
Most common edge-length priors used in practice, including the exponential and uniform priors, satisfy the conditions required for the limiting distributions. It is interesting to note that the limiting distribution of posterior probability does not depend on the prior, except in as much as it must satisfy the initial assumptions. The star-tree paradox is a general one that is not due to, nor influenced by, the choice of priors.
Lewis et al. (2005) pointed out that placing positive prior probability at the star topology will cause the prior for the star topology to converge to 1. Yang and Rannala (2006) and Yang (2007) extended this result to priors that place prior density near the star topology that increases quickly enough with n. If density near the star topology increases without bound, but slowly enough, posteriors for each of the topologies are expected to converge to 1/3. Not surprisingly, because different distributions result, priors such as these do not satisfy the assumptions here. Although such solutions provide insight into how comparison of models of differing dimension can create unusual behavior, it is exactly the type of prior assignment that has been one of the primary criticisms of Bayesian approaches: choosing priors biased towards a particular point of interest.
Another assumption that was made was that edge lengths are identifiable parameters: no two sets of edge lengths give exactly the same pattern probabilities. If this is not the case, integration over regions of t that are far from the generating t* cannot be ignored with large sequence lengths, as they are in Appendix A. It is likely that the distributions and properties of posterior support will be quite different. For models that do not involve rates across sites, as is the case here, identifiability assumptions have been shown to hold in Chang (1996). For models that do involve rates across sites, however, Steel et al. (1994) illustrate that identifiability does not generally hold. Recent work (Allman and Rhodes, 2008; Allman et al., 2008) indicates, however, that for common rates across sites models, like the gamma model of Yang (1994), whereas there may be choices of generating t* that are not identifiable, there will usually be many for which identifiability holds.
Although the results are restricted to four taxa, they are suggestive of behavior with larger trees. For more general trees, the results may be encouraging in that they suggest edge-length priors are not likely to have a large influence on a Bayesian analysis when sequence length is large. On the other hand, the high probability of large posteriors when both long and short edge lengths are present in the generating tree suggests that large posteriors for splits should be of concern when long and short edge lengths are present in a tree. The results also provide some insight into some of the reasons that bootstrap support does not always correlate well with posterior probabilities. For a poorly resolved split, with equal edges, similar distributions are expected but, in cases of long and short branches, posterior probabilities for the long-branch-together split tend to be larger.
| Appendix A. The Limiting Distribution of Posterior Probability |
|---|
|
|
|---|
A more complete statement of the limiting distribution result for posterior probabilities follows; it is the one that is used in the repeated simulations of normals.
Let pk(t;j) denote the pattern probabilities for the pattern k, for the jth topology and edge lengths t = [t1,...,t4,t5]; k is the pattern for an alignment column, for instance, k = xxyy. Here t5 denotes the internal edge length and tl the terminal edge length leading to l. Let t* = [t1*,...,t4*,0] denote the generating edge lengths for the star topology. Let pk = pk(t*) denote the pattern probabilities for the star topology. Similarly, let
k denote the proportion of times pattern k arose in the original alignment.
Let W = [W1,...,W7]T be N(0,
) (normal distributed with mean 0 and covariance matrix
) where
is given by the following relationships for i, j
4, l, l'
3
|
| (A1) |
|
| (A2) |
|
| (A3) |
Let Vj = [W1,...,W4,W4+j]T and let Ij denote the variance-covariance matrix of Vj. The limiting distribution of the posterior for the jth topology is then the same as that of
|
| (A4) |
(z) = P(Z
z) for Z
N(0,1).
The posterior for topology j can be alternatively expressed as the
j/
k
k, where
|
| (A5) |
1,
2,
3]. Unless otherwise stated, convergence means convergence of distributions. The reader interested in a overview of the argument might read the first few sentences of the sections starting in bold below.
The integral over ||t – t*|| >
. Because the log likelihood will only be large for t near the generating t* for large n, it can be shown that for any
> 0,
|
| (A6) |
n
0.
This follows from the consistency of ML estimation for the problem (Wald, 1949; Kiefer and Wolfowitz, 1956). These results require that we be able to extend the definition of pk(t;j) to values of t where one or more of the edge lengths are infinite, and that limits of pk(t;j) coincide with this definition as one or more of the edge-lengths diverge in a corresponding manner. We can do this by independently calculating probabilities of patterns for (groups of) taxa separated by an infinite edge length. Because for the original edge lengths, t*, probabilities of patterns will not be independent for any two sets of taxa, the extended definition will give rise to pattern probabilities that differ from the true pattern probabilities. Because it has been assumed that pattern probabilities from any t
t* will differ from the true pattern probabilities, identifiability of edge lengths holds more generally on this extended parameter space that includes infinite edge lengths.
Because all expectations here involve finite sums, because derivatives of any order of pk(t;j) with respect to edge lengths can be taken, and because all of the pk > 0, it is not difficult to show that all other regularity conditions required for consistency are satisfied. Equation (2.12) of Kiefer and Wolfowitz (1956) then applies giving that, with probability 1, for
> 0, there will be an 0 < h < 1 with
|
|
|
|
|
| (A7) |
nSjn denote the gradient of lj(t) at t*. A Taylor's series expansion gives
|
| (A8) |
|
| (A9) |
.
A bound on the remainder term. There is a constant K such that, with probability 1, for large n, for all ||t – t*|| <
,
|
| (A10) |
sufficiently small and ||t–t*|| <
, t will have positive terminal edge lengths as well. For Markov substitution models with nonzero frequencies and rates of exchange, this is sufficient for the pattern probabilities, pk(t;j) to be positive. We can thus conclude that there is a K such that, for all patterns k and all ||t – t*|| <
, |
|
|
|
Large-sample likelihood theory. Two results used in what follows come from the large-sample likelihood theory (c.f. Lehman, 1983; Pawitan, 2001, chapter 9). These results are that Sjn converges in distribution to a Vj that has a N(0,Ij) distribution and that Jjn converges to Ij. The matrix Ij is assumed positive definite (whether it is or not can be figured out in any particular case and was always true in our examples). The large-sample theory requires that regularity conditions be satisfied. Because all expectations here involve finite sums, because derivatives of any order of pk(t;j) with respect to edge lengths can be taken and because all of the pk > 0, it is not difficult to show that these regularity conditions are satisfied.
The integral over n–2/5 < ||t – t*|| <
. The large-sample likelihood theory results and the bound (A10) can be used to show that exp [lj(t) – lj(t*)] is small enough for n–2/5 < ||t – t*|| <
that (A6) can be refined to
|
| (A11) |
0.
Substituting the bound (A10) in (A8) gives that for n–2/5 < ||t – t*||
|
| (A12) |
> 0 sufficiently small and all n–2/5 < ||t–t*|| <
, there is a
> 0 such that {lj(t) – lj(t*)}/(n||t–t*||2)
–
. Thus |
|
The integral over ||t – t*||
n–2/5. For ||t – t*||
n–2/5, the bound (A10) on the remainder term for (A8) gives that ||Rjn||
Kn–1/5. It follows that
|
|
Cn
exp [n–1/5K] converges to 1. A change of variables to u =
n(t–t*) gives
|
| (A13) |
|
|
For ||u||
ny,
|
|
ny |
|
|
|
|
| (A14) |
For ||u||
n1/4, the bounded derivative of the prior at t* implies that for some K*, |
(t*+u/
n) –
(t*)|
K* n–1/4 This gives that
|
| (A15) |
|
| (A16) |
A normal integral expression. Because the contribution to the integral converges to 0, the region u > n1/10 can be added to the region of integration in (A14). For the terminal edges, Because t*j > 0, –
n t*j will be less than –n1/4 for n large. Thus the regions uj < –
nt*j can be added to the region of integration in (A14). For the middle edge, however, t*5 = 0 and so the constraint u > –
n t* in (A14) gives u5 > 0 for this component. With these changes to the region of integration and with an alternative expression of the integrand as in (A16),
|
| (A17) |
N(Jjn–1Sjn,Jjn–1). If Y is normal then so is Y5 and its mean and variance are determined from the mean and variance-covariance matrix of Y. Specifically, the mean is [Jjn–1Sjn]5 and [Jjn–1]55. Standard normal calculations give that the probability that P(X > 0) =
(µ/
) when X
N(µ,
) Replacing Jjn with its limiting value Ij, the final approximation is
|
| (A18) |
|
|
The covariances for the Wn entries are not difficult to obtain and are given in (A1) to (A3). The central limit theorem then gives that Wn converges in distribution to the W described before (A4).
The limiting distribution. Since Jjn
Ij and the Sjn are obtained from the Wn in the same manner that the Vj are obtained from W, the numerator (A18) converges in distribution to the numerator in (A4) multiplied by
(t*). It then follows that the limiting distribution is as given in (A4).
| Appendix B. The Limiting Distribution of Bootstrap Support |
|---|
|
|
|---|
There are two distinct probability distributions in bootstrapping. One is the probability distribution associated with the generation of data: characterized by the Jukes-Cantor Markov substitution model on the tree. The other is the bootstrap probability distribution. It is dependent or conditional upon the data and is the probability distribution bootstrap samples are drawn from. The limiting distribution is obtained in two steps by first obtaining an approximation to the bootstrap probability of topology 1 for fixed data and then allowing the data set to vary and considering what this approximation converges to.
First, consider bootstrap probabilities for fixed data. The vector giving the proportions of differences in a bootstrap sample, p(b), is obtained by independently sampling n sites from the original alignment and determining the proportions of differences for the pairs. It can be expressed as
|
| (1) |
(i) = [
12(i),...,
34(i)]T and has
kl(i) = 1 or 0 according to whether the pair of taxa k and l have a difference or not at the ith bootstrap sampled site. The reason for expressing p(b) as a mean is that it makes clear that the central limit theorem applies and the bootstrap distribution of p(b) is approximately normal. This is the distribution for fixed data and so the means and variances for it have to be calculated from the bootstrap distribution that yields site pattern according to their frequencies in the fixed alignment. Calculations give the mean of pij(b) as
p(|
| (2) |
|
|
n(p(b)– |
| (3) |
|
|
|
|
(|
| (4) |
It follows that, for fixed data and large n, the bootstrap probability that topology 1 is estimated is well approximated by P(Z1 > 0, Z2 > 0) where [Z1,Z2]T has a multivariate distribution with mean z(
) and variance-covariance matrix
(
,
). Expressing this probability in terms of standardized normal random variables gives that, for a fixed data set and large n, the bootstrap probability that topology 1 is estimated is well approximated by
|
| (5) |
(
(
(
(
The expression (B5) gives an approximation to the bootstrap probability for fixed
. Now allow the data to vary. The situation is very similar to the one that led to (B5) but instead of sampling from the bootstrap distribution that yields site pattern according to their frequencies, sequences are generated from the pattern probabilities implied by the Jukes-Cantor model and the tree. The central limit theorem still applies to
but now its mean is calculated as p, the vector of probabilities that taxa i and j have differing nucleotides, and its variance-covariance matrix is
p(p,w), where
and
have been replaced by their corresponding probabilities p and w. The approximation (B3) still applies, but with p(b) replaced by
and
replaced by p: z(
)
z(p) + z'(p) (
– p). The generating model is the star tree, however, and for this model, z(p) = 0; i.e., the linear transformation of distances in (4) are exactly equal to 0. Thus z(
)
z'(p) (
– p) and similarly as in the argument that led to (B5) we obtain that z(
) is approximately normal but with mean 0 and variance-covariance matrix
(p, w). It follows that –z1(
)/
(
,
)11 and –z2(
)/
(
,
)22 converge in distribution to V1 and V2, which are normal with mean 0 and unit variance as well as correlation
|
|
Because
(
,
) converges to
*, the approximation (B5), to the bootstrap probability that topology 1 is estimated, converges to
|
|
*. | Acknowledgment |
|---|
|
|
|---|
I would like to thank Andrew Roger, Mike Steel, Ziheng Yang, and an anonymous referee for valuable comments. This research was supported by a Discovery grant from the Natural Sciences and Engineering Research Council of Canada.
| References |
|---|
|
|
|---|
-
Allman E. S., Rhodes J. A. Identifying evolutionary trees and substitution parameters for the general Markov model with invariable sites. Math. Biosci. (2008) 211:18–33.[CrossRef][Web of Science][Medline]
Allman E. S., Ané C., Rhodes J. A. Identifiability of a Markovian model of molecular evolution with gamma-distributed rates. Adv. Appl. Prob. (2008) 40:229–249.[CrossRef]
Chang J. T. Full reconstruction of Markov models on evolutionary trees: Identifiability and consistency. Math. Biosci. (1996) 137:51–37.[CrossRef][Web of Science][Medline]
Cummings M. P. S. A., Handley D. S., Myers D. L., Reed Rokas A., Winka K. Comparing bootstrap and posterior probability values in the four-taxon case. Syst. Biol. (2003) 52:477–487.
Felsenstein J. PHYLIP: Phylogeny inference package Version 3.2. Cladistics (1989) 5:164–166.
Felsenstein J. PHYLIP: Phylogeny inference package Version 3.6 (2004) Seattle: Department of Genome Sciences, University of Washington. Distributed by the editor.
Huelsenbeck J. P., Rannala B. Frequentist properties of Bayesian posterior probabilities of phylogenetic trees under simple and complex substitution models. Syst. Biol. (2004) 53:904–913.
Huelsenbeck J. P., Ronquist F. MrBayes: Bayesian inference of phylogenetic trees. Bioinformatics. (2001) 17:754–755.
Jukes T. H., Cantor C. R. Mammalian protein metabolism (1969) New York: Academic Press. 21–123.
Kiefer J., Wolfowitz J. Consistency of the maximum likelihood estimates in the presence of infinitely many nuisance parameters. Ann. of Math. Stat. (1956) 27:887–906.[CrossRef]
Kolaczowski B., Thornton J. W. Is there a star tree paradox? Mol. Biol. Evol. (2006) 23:1819–1823.
Lehman E. L. Theory of point estimation (1983) New York: Wiley.
Lewis P. O., Holder M. T., Holsinger K. E. Polytomies and Bayesian phylogenetic inference. Syst. Biol. (2005) 54:241–253.
Lindley D. V. A statistical paradox. Biometrika (1957) 44:187–192.
Pawitan Y. In all likelihood: Statistical modelling and inference using likelihood (2001) Oxford, UK: Oxford University Press.
Ronquist F., Hulesenbeck J. P. MrBayes 3: Bayesian phylogenetic inference under mixed models. Bioinformatics. (2003) 19:1572–1574.
Saitou N., Nei M. The neighbor-joining method: A new method for reconstructing evolutionary trees. Mol. Biol. Evol. (1987) 4:406–425.[Abstract]
Shafer G. Lindleys paradox. J. Am. Stat. Assoc. (1982) 77:325–334.[CrossRef][Web of Science]
Steel M., Matsen F. A. The Bayesian star paradox persists for long finite sequences. Mol. Biol. Evol. (2007) 24:1075–1079.
Steel M., Székely L. A., Hendy M. D. Reconstructing trees when sequence sites evolve at variable rates. J. Comp. Biol. (1994) 1:153–163.
Suzuki Y., Glazko G. V., Nei M. Overcredibility of molecular phylogenies obtained by Bayesian phylogenetics. Proc. Natl. Acad. Sci. USA (2002) 99:16138–16143.
Tierney L, Kadane J. B. Accurate approximations for posterior moments and marginal densities. J. Am. Stat. Assoc. (1986) 81:82–86.[CrossRef][Web of Science]
Wald A. Note on the consistency of the maximum likelihood estimate. Ann. of Math. Stat. (1949) 20:595–601.[CrossRef]
Yang Z. Maximum likelihood phylogenetic estimation from DNA sequences with variable rates over sites: Approximate methods. J. Mol. Evol. (1994) 39:306–314.[CrossRef][Web of Science][Medline]
Yang Z. Fair-Balance paradox, star-tree paradox, and Bayesian phylogenetics. Mol. Biol. Evol. (2007) 24:1639–1655.
Yang Z., Rannala B. Branch-Length prior influences Bayesian posterior probability of phylogeny. Syst. Biol. (2006) 54:455–470.[CrossRef][Web of Science]
This article has been cited by other articles:
![]() |
E. Susko Bootstrap Support Is Not First-Order Correct Syst Biol, July 1, 2009; (2009) syp016v1. [Abstract] [Full Text] [PDF] |
||||
![]() |
Z. Yang Empirical evaluation of a prior for Bayesian phylogenetic inference Phil Trans R Soc B, December 27, 2008; 363(1512): 4031 - 4039. [Abstract] [Full Text] [PDF] |
||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||

























