© 2007 Society of Systematic Biologists
Resource-Aware Taxon Selection for Maximizing Phylogenetic Diversity
Edited by Mike Steel: Associate Editor
EMBL—European Bioinformatics Institute Wellcome Trust Genome Campus, Hinxton, Cambridge, CB10 1SD, UK E-mail: pardi{at}ebi.ac.uk (F.P.)
| Abstract |
|---|
|
|
|---|
Phylogenetic diversity (PD) is a useful metric for selecting taxa in a range of biological applications, for example, bioconservation and genomics, where the selection is usually constrained by the limited availability of resources. We formalize taxon selection as a conceptually simple optimization problem, aiming to maximize PD subject to resource constraints. This allows us to take into account the different amounts of resources required by the different taxa. Although this is a computationally difficult problem, we present a dynamic programming algorithm that solves it in pseudo-polynomial time. Our algorithm can also solve many instances of the Noah's Ark Problem, a more realistic formulation of taxon selection for biodiversity conservation that allows for taxon-specific extinction risks. These instances extend the set of problems for which solutions are available beyond previously known greedy-tractable cases. Finally, we discuss the relevance of our results to real-life scenarios.
Keywords: Biodiversity conservation; comparative genomics; dynamic programming; phylogenetic diversity; Noah's Ark Problem; species choice; taxon selection
Received September 21, 2006; Revised November 16, 2006; Accepted December 29, 2006
The phylogenetic diversity (PD) of a set of taxa (Faith, 1992), (roughly) the same amount of resources. In reality,loosely defined as the total length of the evolutionary tree connecting them, is a measure of increasing importance in a number of areas in biology, in particular biodiversity conservation (Crozier, 1997; Barker, 2002; Mace et al., 2003; Forest et al., 2007) and comparative genomics (see Pardi and Goldman, 2005, and references therein). In bioconservation, the protection from extinction of species or populations that have a large total PD or, equivalently, that represent a large portion of evolutionary history: Nee and May, 1997) is regarded as a good way to preserve genetic diversity (Crozier, 1997) and, more generally, diversity in the biological features of the existing organisms (Faith, 1992). In genomics, on the other hand, comparing homologous sequences with a high total divergence (which is to sequences what PD is to taxa) allows testing of evolutionary hypotheses and detection of various genomic features with high statistical power (Thomas et al., 2003; Eddy, 2005; Pardi and Goldman, 2005). Comparative genomics, like biological conservation, is therefore better done on phylogenetically diverse sets of genes (or individuals, populations or species: the unit is not important and we will use the word taxa throughout). Genome sequencing projects are increasingly aware of this (Margulies et al., 2005).
Often it will not be possible to save every existing species from large extinction events (such as the one we are currently causing) or to sequence every available genome. It is therefore natural to ask how choices should be made in these areas—"the agony of choice" is a popular phrase in bioconservation (Vane-Wright et al., 1991; Crozier, 1992). TX their (possibly rooted). Because PD has come to be regarded as a good metric for measuring taxon importance, there has recently been interest in the formal optimization problems associated with it: how should we select taxa in order to maximize the resulting PD? The simplest of these problems, consisting of selecting a given number (decided a priori) of taxa, is easily solved by a greedy strategy (Steel, 2005; Pardi and Goldman, 2005). However, this scenario assumes that it is feasible to determine in advance the number of taxa that will be dealt with by the available resources, which is only true if the taxa require (roughly) the same amount of resources. In reality, this is usually not true: in bioconservation, for example, we may be designing a protected geographical region of fixed area, and different species will typically require different amounts of land; similarly, in the case of sequencing, different genomes have different sizes and therefore will have different costs, not only in terms of money but also of time and instruments required for sequencing. Potentially, a choice will have to be made between selecting few "expensive" taxa or many "cheap" ones.
Assuming that "costs" (whatever the available resource, e.g., money, time, labor, machinery, space, etc.) can be roughly quantified, a possible approach to limit "expenditure" could consist of modifying the evolutionary tree used as a basis for calculating PD by shortening each terminal branch by an amount depending on the cost of its taxon, so that the selection of costly taxa would be discouraged (Steel, 2005; Pardi and Goldman, 2005). This is not very satisfying; for example, there is no guarantee that the selected taxa will have maximum PD among all the other choices of taxa with the same total cost.
Here, we adopt a more direct approach: given taxon-associated costs and an estimate of the available resources, which we naturally call the budget, we aim to select the set of taxa of maximum PD among those sets with total cost at most equal to the budget. This can be re-expressed using some formalisms that will be useful throughout this paper. Let
be the chosen phylogenetic scope (Pardi and Goldman, 2005), i.e., the set of taxa we aim to select from, and
X their (possibly rooted) phylogenetic tree, where all branches have non-negative lengths. Each taxon (e.g., species or sequence) s
has a nonnegative integer cost cs. We aim to
|
| (1) |
Although PD(S) has been simply defined as the total length of the smallest subtree of
X connecting all taxa in S (the "minimum spanning path" of Faith, 1992), there have been different interpretations of this definition in the literature (see, for example, the discussion in Faith and Baker, 2006; Crozier et al., 2006). These probably derive from different understandings of what a subtree is: if
X is a rooted tree, do its subtrees necessarily include its root? For example, does the smallest subtree of the mammals that contains human and chimp necessarily contain also the common ancestor of all mammals? It is clear that different ways of answering this question lead to different values of PD and sometimes to different solutions to the optimization problem (1). It is therefore useful to distinguish between the two possible definitions of PD. We will call the unrooted phylogenetic diversity (u PD) of a set of taxa S
eq
the total length of the smallest unrooted subtree of
X connecting all the taxa in S but not necessarily the root (assuming there is one). Conversely, when
X is rooted, the total length of the smallest subtree of
X having the same root as
X and connecting all the taxa in S will be referred to as the rooted phylogenetic diversity (r PD) of S (see Fig. 1). Clearly, denoting the root of
X by
(and assuming 

), we have that rPD(S) = uPD(S
{
}). In practice, the choice between these two measures will be determined by the intended application: u PD seems more appropriate for comparative genomics (as most sequence comparison techniques are independent of root placement), whereas r PD seems to be preferred in conservation biology (Rodrigues and Gaston, 2002). Furthermore, as we will show in the following, the optimization problems associated with these two metrics can be of rather different difficulty, with u PD generally posing more problems than r PD. Insisting on this distinction is therefore not merely a pedantic exercise.
|
Incidentally, this ambiguity in the notion of PD is not surprising if we consider that Faith's paper introducing PD (Faith, 1992) was itself somewhat ambiguous: whereas one of the formal results (Faith, 1992: 4, last line) only holds for u PD, the only example that allows discrimination between the two interpretations (Faith, 1992: fig. 3a, set R3), is consistent with r PD but not u PD. In later papers (e.g., Moritz and Faith, 1998; Faith and Baker, 2006), Faith and colleagues clarified their preference for r PD, but the alternative definition for PD has become widespread (e.g., Crozier, 1997; Barker, 2002; Steel, 2005; Minh et al., 2006) and other authors even refer to r PD as evolutionary history (Nee and May, 1997).
One way of solving problem (1) would be to calculate the cost and PD for all 2n subsets of the taxa, where n = |
| is the number of taxa being selected among. However, this is not feasible even for moderate values of n, and therefore alternative approaches must be sought. The next two sections give novel algorithms that efficiently solve problem (1) for both definitions of PD. Before then, however, some considerations are needed. First, these algorithms will assume that
X is a bifurcating tree, which clearly is not a limitation, as every multifurcating tree can be resolved into an equivalent bifurcating tree in which the new (internal) branches have length 0. Second, note that the formulation (1) can also be used to express the scenario whereby a number of taxa have already been conserved (or sequenced): the search for a maximally diverse extension (Pardi and Goldman, 2005) of an initial set I
can be implemented by solving problem (1) where all leaves in I have their cost set to 0. Third, note that all costs and the budget are (implicitly in the following) assumed to be integers. As will be shown, the algorithms' running times grow quadratically with the budget B. It is therefore important to express all costs and budget as multiples of large units to limit the value of B and, consequently, running times. This granularity is not unrealistic, since real-life resources are inherently discrete (e.g., currencies) and costs and budgets will often be known not with great precision but only approximately.
Compared to the problems for which an efficient solution is known (Steel, 2005; Pardi and Goldman, 2005), problem (1) is clearly a step towards realism. In the context of biodiversity conservation, however, there is an important factor that is not taken into account: different taxa have different risks of extinction (Witting and Loeschcke, 1993, 1995). For example, it is clear that if a species is not endangered at all, then expending resources to conserve this or a very closely related species is not a good choice. It is intuitive that conservation efforts should concentrate on the most endangered species, although under some particular conditions the contrary seems to be true (Weitzman, 1998). These aspects have been formalized by Weitzman (1992, 1993, 1998) into the suggestively named "Noah's Ark Problem" (NAP) (Weitzman, 1998), which is receiving increasing attention especially in environmental and ecological economics (Simianer et al., 2003; Reist-Marti et al., 2003; van der Heide et al., 2005). Although Weitzman (1998) derived a myopic, or greedy, ranking criterion that assigns each taxon a conservation priority score, an exact algorithmic solution to the NAP remains an open problem.
Hartmann and Steel (2006) have shown that some special cases of the NAP can be solved, again, with a greedy algorithm. The algorithms we present here allow the resolution of further relatively general cases of the NAP, extending the scenarios considered by Hartmann and Steel. Although the main focus of this paper is on results that can be used in more than just biodiversity conservation, we will illustrate the consequences of our work for the NAP in Applications to the Noah's Ark Problem.
Finally, we note that problem (1) (and the NAP we define below, which generalizes it) is NP-hard, as the knapsack problem, another well-known NP-hard problem (Cormen et al., 2001), is simply its special case for star-shaped trees (Hartmann and Steel, 2006). Technically, our algorithms are examples of pseudo-polynomial time algorithms: although they run in time polynomial in B, this number may itself grow exponentially in the size of the input (see Garey and Johnson, 1979, for further discussion).
| A Dynamic Programming Algorithm for the Rooted Case |
|---|
|
|
|---|
When phylogenetic diversity is defined as r PD (as we implicitly assume throughout this section), problem (1) can be solved with a dynamic programming algorithm, which we will describe in this section. In the next section we will show that this algorithm can be extended to solve the unrooted (u PD) version of problem (1).
The key observation that allows the use of a dynamic programming algorithm is that optimal solutions to problem (1) can be simply decomposed into optimal solutions to its subproblems—technically, problem (1) is said to have optimal substructure (Cormen et al., 2001). As a consequence, optimal solutions to problem (1) can be constructed by first tackling and solving its subproblems and then combining the optimal solutions thus found.
In order to see what this means in practice, we need to introduce some concepts that allow us to apply problem (1) to smaller portions of
X. We define a clade as any subtree of
X consisting of a branch a and everything else below a in
X (note that we imagine
X with its root at the top); a clade should be thought of as rooted at the top of its top branch a. In a bifurcating tree, every clade is composed of a root branch and, possibly, by two other clades, which we call its subclades. (For example, in Fig. 2, left, clade
has two subclades,
and
, whereas
is composed of one terminal branch and no subclades.)
|
A way to decompose problem (1) into smaller subproblems is to ask what is the best way to invest a given part of the budget into a given clade (i.e., which of the taxa in this clade should be selected given that part of the budget). It turns out that we can answer this question for all clades
B incrementally, starting from the clades consisting of only a terminal branch and then using the solutions already found to construct the solutions for larger clades.
Formally, the subproblems consist of finding, for every b
{0,1,...,B} and every clade
, a subset S of the taxa in
that maximizes rPD
(S) subject to
s
Scs
b. Here rPD
(S) denotes the rooted PD calculated as if
was the whole tree (e.g., in Fig. 2, rPD
({E}) = 3.0). S will be called an (optimal) solution for
and b. Note that one of these subproblems (the one with
=
X and b = B) simply coincides with problem (1) itself (
X can always be seen as having a root branch, possibly of zero length, and therefore is a clade itself). Optimal solutions to these subproblems (or more precisely some sufficient information about them, as we will see later) can be stored in a table (the solutions table) whose rows correspond to all the different clades and whose columns correspond to all the sub-budgets 0,1,...,B (see Fig. 2). Clearly, position (
,b) will contain (information about) an optimal solution for
and b. We will show that this table can be completed one row at a time, starting from the bottom and going up.
First, the solutions for the clades consisting of only a terminal branch (leading to, say, taxon s) are simply either the empty set
or {s}, depending on whether the taxon is too expensive to be taken with the available sub-budget (i.e., cs > b), or not (cs
b), respectively. Therefore the rows of the solutions table corresponding to terminal branches (in Fig. 2, the 3rd, 4th, 6th, and the last two) can be filled without looking at the content of any other row.
Second, when instead a clade
contains two subclades (
and
), an optimal solution S for
and sub-budget b will simply amount to the union S = S![]()
S
of two optimal solutions for two other subproblems: S
will be optimal for
and some sub-budget i
b, whereas S
will be optimal for
and b–i. (The reason for this is simple: S is naturally partitioned into S
and S
, containing the taxa in
and
, respectively; calling i the total cost of S
, if either S
were not optimal for
and i or S
were not optimal for
and b–i, then we could replace this suboptimal choice of taxa with a better one and therefore improve also S, but this would contradict the fact that S is optimal.) For example, in Figure 2, an optimal solution for
and b = 4 is {C,E}, where {C} is optimal for
and sub-budget 2, and {E} is optimal for
and 2.
Importantly, if we have already calculated and stored optimal solutions for
and
and for all sub-budgets, it becomes possible to find a solution for
and any given b: denoting by S
(i) the solution stored for
and i (and similarly for S
(j)), just compare all the subsets S
(0)
S
(b), S
(1)
S
(b–1), ..., S
(b)
S
(0) and take the one with the largest r PD. This is guaranteed to find an optimal solution for
and b, because the possibility of decomposing an optimal solution S into S![]()
S
(where S
is optimal for
and some i
{0,1,...,b}, and S
is optimal for
and b–i) implies that rPD(S) = rPD(S![]()
S
) = rPD(S
(i)
S
(b–i)) and therefore S
(i)
S
(b–i) is also optimal. Note that if there are multiple optimal solutions for
and i (or for
and b–i), some of which are empty and some of which are not (possible if there are paths of zero length from the root of the clade to some of the taxa), we must ensure that the stored solution S
(i) (or S
(b–i)) is nonempty. Otherwise, we may have that the union of optimal solutions is nonoptimal (specifically, when rPD(S![]()
S
) > 0 and rPD(S
(i)
S
(b–i)) = rPD(

) = 0).
Consequently, we can fill the entire solutions table one row at a time: because the content of the row for a clade
can be completely derived from the content of the rows for its subclades
and
, we just need to make sure that, whenever we fill the row for a clade, the rows for its subclades have already been filled. This can be achieved by dealing with the clades in a bottom-up order: let the rows in the table be ordered according to a top-down traversal of all clades (as illustrated in Fig. 2); then, fill the rows from the last to the first. Once the entire table has been filled, the solution to problem (1) is available from its top right corner.
Although storing entire solutions in the solutions table is feasible, this is certainly not efficient, as solutions can contain many taxa and therefore require a variable amount of memory and time to be handled. As we will show, instead of storing solutions, it is sufficient to retain their r PD and the way the sub-budget should be partitioned between the two subclades (if there are any). More precisely, let S
(b) be the solution found for clade
and sub-budget b; instead of storing S
(b), we will store two quantities: (a) the maximum r PD achievable in
with sub-budget b, which we call 
(b) and is equal to rPD
(S
(b)); and (b) (only when
has two subclades) the sub-budget that solution S
(b) allocates to
's left subclade, which we call 
(b)—obviously this also determines the sub-budget to allocate to the right subclade, b–
(b). Note that this does not mean that the actual expenditures in the subclades need equal 
(b) and b–
(b), but rather that the left part of S
(b) is optimal for 
(b), and its right part for b–
(b). In Figure 2, each cell in the solutions table shows 
(b) at the top and, when appropriate, in the two bottom corners, 
(b) on the left and (for illustrative purposes) b–
(b) on the right, which indicates the way the sub-budget should be partitioned between the two subclades.
Note that it is precisely thanks to storing 
(b) that solutions need not be memorized; this information (once available) allows us to reconstruct the optimal solution found for any given
and b: just visit all the clades below
in a top-down fashion always using the 
values to find out what part of b should be devoted to each clade; in the end, b will have been broken into all the expenditures to allocate to each taxon in
; a taxon s should be included in the solution only if the expenditure for s is greater or equal to its cost cs (see Appendix, Algorithm 3, reconstruct(
,b), for a formalization of this procedure). For example, imagine that we wish to reconstruct the solution for clade
and sub-budget 4 in Figure 2; the two values at the bottom corners of the table entry for
and 4 (shaded gray) indicate that we should allocate 2 to
and 2 to
.
only contains taxon C, whose cost is exactly 2 and therefore should be selected.
has two subclades, one containing D and the other E; the table entry for
and 2 indicates that nothing should be spent in the left subclade (so D should not be selected, as its cost is greater than 0) and that 2 should be assigned to the right subclade (so we do select E, as its cost is exactly 2); therefore, the solution for
and 4 is {C,E}.
The solutions table is filled with 
(b) and 
(b) values in a way analogous to the one we described above in terms of whole solutions. The clades are visited in a bottom-up order. For the clades only consisting of a terminal branch (leading to, say, taxon s), the 
(b) values are set to 0 for entries with b up to (but not including) cs and to the length of the terminal branch for the remaining entries; the 
(b) are left undefined, as they have no meaning for these clades. For example, see the solutions table row for
(shaded red) in Figure 2.
When instead we visit a clade
composed of a branch (a) and two subclades (
and
), we need to consider two cases. First, for all the entries with b smaller than the minimum cost among the taxa in
(which we denote by 
), 
(b) is set to 0, as clearly the sub-budget b is not enough to cover the cost of any of the taxa in
; note also that for these entries 
(b) can be set to any i = 0,1,...,b, as any of these values will lead to reconstructing the empty solution (for example, see the first two entries in the table row for
in Fig. 2). Second, for the remaining entries (those with b

), 
(b) is set to the maximum value among ta+
(0)+
(b), ta+
(1)+
(b–1),..., ta+
(b)+
(0) (where ta represents the length of a), as this is the r PD of the best possible combination of complementary solutions for
and
; the corresponding 
(b) is set to a value of i
{0,1,...,b} that maximizes 
(i)+
(b–i). For example, suppose we aim to fill the entry for
and 4 in Figure 2. Because we are proceeding in a bottom-up fashion, the rows corresponding to
and
have already been filled, and the 
and 
values are all available. Therefore, 
(4) = max {1.0+
(0)+
(4),1.0+ 
(1)+
(3),...,1.0+
(4) +
(0)} = max {5.0,5.0,5.0,2.0,2.0} = 5.0. This corresponds to consideration of combining the first entry shaded in red with the fifth in blue, the second in red with the fourth in blue, and so on; in the end we check which of these combinations has given the largest r PD. The value of 
(4) is set accordingly: in this case, there are three equivalent combinations and 
(4) could be set to 0, 1, or 2 (leading to two different but equally good solutions).
In summary, the
and
values are calculated with the following recursions (which assume that
is composed of branch a and, possibly, subclades
and
):
|
|






b, or i = b–

b). Implicitly storing the empty solution would either be wrong (if ta > 0, any optimal solution is certainly nonempty) or (if ta = 0) might lead to constructing a nonoptimal solution further up in the tree (see discussion above).
The various minimum costs 
for all clades should also be derived and stored. For a given
, this can be done when we set about filling its row, by either simply copying cs (if
is a terminal branch leading to s) or taking the minimum between 
and 
(if
contains subclades
and
). In Figure 2, the 
values are reported on the right of the solutions table.
Once all the
and
values have been derived in the solutions table, the solution implicitly found for the top right entry of the table is also a solution to problem (1) and it can be reconstructed using the 
values in the way we described above. However, there may be more than one optimal solution to problem (1), all equally good with respect to r PD but possibly involving different overall expenditures. It is not guaranteed that the solution reconstructed as described above is the cheapest among them. Although this is not required by the problem formulation (1), this is clearly a desirable property and easy to achieve: by looking at the solutions found for smaller sub-budgets, we can check if the budget can be reduced without affecting the optimal r PD. This corresponds to scanning the solutions table from its top right entry towards the left, until a reduction in 
X(b) is observed. The solution for the last (i.e., least) b with 
X(b) = 
X(B) is a minimal-cost r PD-optimal solution and can be reconstructed in the usual way. In some cases it may even be of interest to derive all (minimal-cost) optimal solutions, which could be achieved by storing mutiple 
(b) values and using all of them in the reconstruction at the end. However, we note that the number of solutions to reconstruct may grow exponentially in the size of the problem.
This concludes the description of our dynamic programming algorithm for maximizing r PD subject to cost constraints. A different description, less verbose and directly convertible into computer code, is given by the pseudocode in Algorithms 1, 2, and 3 in Appendix. Additionally, the dynamic programming algorithm could be modified to solve problem (1) for trees with branches of any length (i.e., where negative lengths are allowed), but for simplicity we do not describe this here.
Regarding the computational complexity of our algorithm, the calculation of a single entry in the solutions table requires O(b) = O(B) time, as all possible ways to split sub-budget b may need to be examined. This must be repeated for each of the (2n–1)(B+1) = O(nB) subproblems (where n = |
|), giving a total of O(nB2) operations for filling the solutions table. The reconstruction of an optimal solution for problem (1) from the top right entry only takes O(n) time, as it consists in a top-down traversal of all the 2n–1 = O(n) clades of
X, in which each clade can be dealt with in constant time. Therefore, the entire algorithm has time complexity O(nB2). Memory complexity is dominated by the size of the solutions table, and so is O(nB).
Finally, we note that problem (1) could also be formulated as an integer linear programming (ILP) problem (in a way analogous to that of Rodrigues and Gaston, 2002)and solved with standard off-the-shelf techniques. However, there are no guarantees that the running time of these algorithms would be better than exponential in n.
| The Unrooted Case |
|---|
|
|
|---|
We now turn to solving problem (1) with PD defined as u PD, which we call the unrooted problem. An example is given in Figure 3. At first glance, an obstacle to its solution seems to be that there is no evident way to break it into smaller problems: even if we solve the unrooted problem on portions of
|
The key observation here is that a solution to the unrooted problem, if not the empty set, is equal to {s}
R, where R is a solution to the rooted problem with budget B–cs applied to
{C,D,E}, where {C,D,E} is a solution to the rooted problem on
This allows us to reduce the unrooted problem to a number of related rooted problems: we could iteratively root
X in each of its taxa s and calculate a solution Rs to the rooted problem with budget B–cs (or skip s if cs > B); any of the subsets {s}
Rs with the largest u PD (or, equivalently, any of the ones with the largest rPD(Rs)) is a solution to the unrooted problem.
However, this procedure would involve repeating the r PD-maximization algorithm once for every taxon, thus requiring O(n2B2) time. A more efficient approach, which we describe in the Appendix, consists of extending our notion of clade so that it includes all the different taxon-rootings
s; as before, we solve the rooted subproblems (with many possible sub-budgets) for all the clades in
X. Because this now includes additional clades not present before, we devise a new ordering of the clades so that we can incrementally derive new solutions from the previously calculated ones. In the end, we compare the rooted solutions found for the various
s; any of the best ones will provide us with an optimal solution to the unrooted problem.
This more efficient algorithm for the unrooted problem is practically equivalent in computational complexity to the one in the last section. As described in the Appendix, there are now 2(2n–3) clades, and therefore still O(n) rows in the solutions table (and O(B) columns). Derivation of the solution for
and b from the other stored solutions and reconstruction of any of these solutions using the 
values are achieved in exactly the same way as before and therefore still take O(B) and O(n) time, respectively. Comparing rPD(Rs) for all the rooted solutions Rs in order to find the unrooted solution {s}
Rs also takes just O(n) time. Therefore, this algorithm has time complexity O(nB2) and memory complexity O(nB).
| Applications to the Noah's Ark Problem |
|---|
|
|
|---|
In the NAP (Weitzman, 1998) each taxon s

s for taxon s, the probability of survival will also (generally) increase as a function ps(
s). In the original formulation (Weitzman, 1998), ps grows linearly from an initial value as to a maximum probability bs, as the expenditure is increased from 0 to a cost cs. For this shape of ps(
s) (and others) optimal solutions to the NAP are "extreme"; i.e., for every taxon s (with the possible exception of one taxon), either nothing or the whole cs is spent on it (Weitzman, 1998). Following this, Hartmann and Steel (2006) have cast the NAP in a form (not entirely equivalent to the original NAP) analogous to that of problem (1):
|
| (2) |
S or not. It is easy to see that
|
| (3) |
Note that problem (1) is the particular case of problem (2) obtained by setting as = 0 and bs = 1 for all taxa. We will refer to problem (1) as the 0
1 NAP and to (2) as the as
bs NAP. This notation indicates the probabilities of survival without and with conservation (left and right of the arrow) and the conservation costs (above); these may be constants or taxon-dependent (indicated by variables indexed by taxon s).
There is in fact a hierarchy of subproblems of the as
bs NAP. The simplest of them is the 0
1 NAP, which can be solved with a greedy algorithm (Steel, 2005; Pardi and Goldman, 2005). There are other greedy-tractable regions in this hierarchy, notably the 0
1 NAP applied to an ultrametric tree and the 1–qs
1–
qs NAP, where for every taxon s the initial probability of extinction qs can be reduced by a constant factor
(0
1) by paying a constant price c (Hartmann and Steel, 2006).
We will show that other (larger) regions of this hierarchy can be solved with dynamic programming algorithms: our base algorithm for problem (1) already demonstrated this for the 0
1 NAP, and we now show that the same holds, more generally, for the as
1 NAP. This is the relatively realistic scenario whereby conservation projects with variable costs cs completely ensure survival of the species to which they are applied, which can have different initial risks of extinction (as depends on s). This may be the case when a taxon can be saved by simply saving a few of its individuals, be it in a zoo or in a prophet's ark. An example of as
1 NAP is given in Figure 4: panel (a) gives a tree
X for 52 Madagascar lemurs (including all those in the current IUCN Red List of Threatened Species, IUCN, 2006) and putative costs cs of conserving each of them, and panel (b) shows the probabilities of survival without conservation, as, and their effect on this problem, as explained below.
|
The key observation here is that, for any instance of the as
eq
Formally, define
' as the tree obtained from
by multiplying each branch length ta by a factor equal to
s
Ca(1–as). Then, the gain in
(rPD) due to conserving the taxa in any subset S can be simply derived from Equation (3):
|
|
It is interesting to reflect on the intuitive meaning of this transformation. The new tree is obtained by multiplying each branch by the probability that, as a result of extinctions, that branch will disappear from the tree connecting the surviving taxa. Therefore, we can think of this as a form of weighting that gives more importance to those parts in the tree that are more likely to get lost. For example, in Figure 4b, most of the internal branches in
' have a length that is very close to 0. This is because these branches have many taxa below them and are therefore unlikely to be lost. Contrast this with the internal branch leading to the two Vva subspecies: this branch is relatively long, as both Vva ru and Vva va are likely to become extinct.
Thanks to this transformation, a subset of taxa that maximizes
(rPD) on
subject to any given constraint also maximizes r PD on
' subject to that same constraint, and vice versa. If the constraint is simply a limit on the number of taxa, this means that any as
1 NAP can be solved by solving the corresponding 0
1 NAP on the transformed tree
', which can be done with a simple greedy algorithm (Steel, 2005; Pardi and Goldman, 2005). This result was also shown by Hartmann and Steel (2007), who independently derived the same branch length rescaling described above. The tractability of the as
1 NAP with a greedy algorithm also derives from its being a subproblem of the greedy-tractable 1–qs
1–
qs NAP (Hartmann and Steel, 2006).
More generally, if the constraint is of the cost-budget type, the above means that any as
1 NAP can be reduced to the corresponding 0
1 NAP on the transformed tree
', which can be solved with our base algorithm. Being able to solve the rather general as
1 NAP is a novel result, and we suspect that the applicability of dynamic programming techniques to the NAP is even wider.
Notice also that, even more generally, any as
bs NAP (problem 2) can be reduced to a NAP with the form 0
bs'. Constructing
' as before, a similar proof to that above shows that solving any as
bs NAP on
is equivalent to solving the corresponding 0
(bs – as)/(1 – as) NAP on
'. Even though we cannot use this reduction to solve the general as
bs NAP (as realistic algorithms to solve the 0
bs NAP are not currently known), this shows that all the difficulty of the general NAP is somehow already present in the 0
bs NAP.
It is important to realize that our dynamic programming algorithm cannot be applied or adapted to the general as
bs NAP (or the 0
bs NAP), as these problems do not have optimal substructure (it is not difficult to construct an example where the solution for a clade
is not obtainable as the union of solutions for
's subclades). One exception to this is the 0
b NAP, which does have optimal sustructure and therefore can be solved with a simple adaptation of our base dynamic programming algorithm. As the 0
b NAP can hardly have any practical importance, we do not describe this adaptation here, but this observation allows us to show that, interestingly, the class of (known) greedy-tractable subproblems of the NAP (Hartmann and Steel, 2006) is contained in the one tractable with dynamic programming algorithms: the 0
1 NAP applied to an ultrametric tree is a subproblem of the 0
1 NAP tout court, and any 1–qs
1–
qs NAP can be reduced (following our observation in the preceding paragraph) to a corresponding 0
1–
NAP, also solvable (as just mentioned) with dynamic programming.
Finally, note that problem (2) assumes a rooted definition of PD. Although we are not aware of work on the NAP using u PD instead of r PD (which we call u PD-NAP), it is natural to ask whether the techniques presented here (and in other papers such as Hartmann and Steel, 2006) can also be applied to the u PD-NAP. Using the ideas in this and the previous section, it is possible to solve any as
1 u PD-NAP. Its solution is either empty or equal to {s}
R, where R is a solution to the as
1 NAP on
s with budget B–cs (which we can solve as we just showed). Therefore, the problem can be solved by rooting the tree in each taxon in turn; for each rooting, transform
s in the way described before (this transformation is dependent on the position of the root, so we will get a different tree each time), and solve the resulting 0
1 NAP with budget B–cs. Denoting by Ri the solutions obtained when rooting in each taxon si, then a solution to the as
1 u PD-NAP is among {s1}
R1,{s2}
R2,...,{sn}
Rn and can be found by simply comparing their
(uPD). Note that the time complexity of this algorithm will now be O(n2B2), as the 0
1 NAP will have to be solved on n different trees.
| Discussion |
|---|
|
|
|---|
Firstly, we have presented a new formalization of the problem of selecting taxa to maximize the retained phylogenetic diversity. The basic advantage, compared to past formalizations (Steel, 2005; Pardi and Goldman, 2005), is the possibility of taking into account the different resources required by the different taxa, which we assume to be quantifiable into taxon-specific costs.
Secondly, we have given algorithms that solve this problem for two definitions of PD. These algorithms run in O(nB2) time and use O(nB) memory, where n is the number of taxa to be chosen from and B is the budget expressed as the number of units of a relevant resource (e.g., currency). The fact that the computational complexity depends on B may seem problematic: for example, if the underlying resource is money, the budget could be a very large number (of the order of the millions, if measured in common currencies) and O(nB2)-time algorithms would be unusable in practice. To avoid this, B and all costs should be preprocessed and expressed as multiples of a large unit (such as their greatest common divisor, efficiently found with a generalization of Euclid's algorithm; Cormen et al., 2001). For example, in the case of conservation projects with budgets of the order of millions of euros, we may express everything in multiples of 10,000. This is probably precise enough to satisfy the accountants, makes B of the order of hundreds and permits the algorithms to run quickly. In order to check their efficiency, we implemented our algorithms (C++ code available via http://www.ebi.ac.uk/goldman/rats) and found that for values of B of the order of the thousands, our program still only takes few seconds per clade (1.7-GHz Pentium processor with 512 MB of RAM), and up to few minutes per clade for B = 100,000. Note that these are the times needed to process one clade (i.e., fill its row in the solution table) and that they do not depend on the size of clades (because each nontrivial clade will require the same O(B2) operations to be processed). The total running time is obtained by multiplying the clade-processing time by the number of clades (2n–1 in the rooted case) and therefore depends on the number of taxa n but is independent of the tree shape.
Thirdly, we have shown that many cases of another formalization of taxon selection, the Noah's Ark Problem (NAP), can be transformed into instances of the problems solvable with our algorithms. We are currently working on the application of dynamic programming techniques to a form of the NAP that allows for more general relationships between conservation expenditure and probability of survival.
Although we realize that optimization problems such as the one formulated here (or even the NAP) are mainly of theoretical interest currently, it is still meaningful to ask how realistic and complete they are in including factors relevant to the selection of taxa. Many factors other than PD can be incorporated into problem (1), and the NAP, by adding to the length of each terminal branch a term quantifying the taxon's importance in relation to those factors (Steel, 2005; Pardi and Goldman, 2005)—this term coincides with what Weitzman (1998) calls the species' utility. Whereas for comparative genomics this approach seems capable of dealing with most selection criteria (Pardi and Goldman, 2005), for biodiversity conservation one of the most important factors, the actual probability of extinction, is much better dealt with by the NAP.
An important question that may be asked regarding the realism of our problem (and the NAP) is whether PD is a good guiding criterion for taxon selection. For comparative genomics, answering this question will involve investigating the relationship between uPD(S) and the statistical power (of tests on evolutionary processes) that results from comparing the sequences in S. Along the lines already considered by Eddy (2005) and McAuliffe et al. (2005), it will be interesting to tackle questions such as: Under which circumstances is PD maximization not the best way to ensure high statistical power? Do these circumstances ever arise in practice? Is it possible to define an alternative measure of sequencing worth that more closely reflects power? Are the answers to these questions dependent on which statistical test is going to be carried out?
As for conservation biology, we note that PD is not the only measure of conservation worth that has been proposed (reviewed by Crozier, 1997). A widespread characteristic of proposed measures is that, as May recommended in his seminal note (May, 1990), they give central importance to taxonomic (phylogenetic) relationships. Because they try to formalize the intuitive notion of diversity on a phylogenetic tree, many of them are mathematically related. Genetic diversity (GD) (Crozier, 1992, 1997) is defined as the probability that the set of taxa preserves more than one allele per site. A branch a in the tree is labeled with pa, the probability that an allele changes in the transition from one end of branch a to the other. Then GD(S) = 1–
a(1–pa), where the product is over all branches a that are preserved (i.e., all branches in the smallest unrooted tree connecting all taxa in S). GD turns out to be strictly related to u PD: if we assume that pa and the branch length ta are simply related through pa = 1–e–
ta, which is typical of sites where the number of alleles is so high that it is practically impossible to change back to a previously held state, then GD(S) = 1–
ae–
ta = 1–e–
ata = 1–e–
· uPD(S). Therefore, as GD grows monotonically with uPD, maximizing GD is equivalent to maximizing uPD. This result was suggested without proof by Crozier (1997).
Even species richnessGaston and Spicer, 1998 (the number of different species), which is probably still the most common measure of biodiversity for conservation, is mathematically related to PD: it is simply equivalent to r PD in an ultrametric star tree. Therefore, all the results presented here can be extended to the maximization of (expected) species richness (but the optimization problems are greatly simplified and better algorithms exist). The number of taxa conserved can also be used as a secondary criterion to discriminate among equally good choices of taxa (i.e., those with equal PD and total cost) and this (or even other criteria) can be easily incorporated in our algorithms by suitably setting 
(b) when multiple choices for this value produce the same r PD.
Furthermore, measures of diversity mathematically equivalent to PD are likely to arise in many other fields—e.g., ecology (Petchey and Gaston, 2002a, 2002b) and social sciences (Nehring and Puppe, 2002, 2003)—whenever a tree is a good way to represent relationships (often nonevolutionary) among the objects of study. For example, a tree ("dendrogram") is often the output of data clustering techniques (Everitt et al., 2001; Eisen et al., 1998). Our results may have even wider applicability than we have suggested here.
An important shortcoming of the NAP is that species are assumed to survive or become extinct independently from one another, whereas in reality strong interdependencies may exist (for example, between predators and prey; van der Heide et al., 2005; Witting and Loeschcke, 1995; Witting et al., 2000). These dependencies can in principle be formalized, but the resulting optimization problem is likely to be a very difficult one. In particular, we doubt that dynamic programming approches such as ours will be effective: in order for a problem to have optimal substructure, a degree of independence between its parts is needed. Heuristic approaches such as hill-climbing or evolutionary algorithms may be preferable.
Note that the assumption of independence may be justified for exsitu conservation—removal and protection of a small number of individuals from their environment does not affect the survival of other taxa—but clearly not for in situ conservation (van der Heide et al., 2005). In particular, when protection is applied to a geographical area rather than to a species or population, survival probabilities are raised simultaneously for the entire group of taxa that lives in that area. Interestingly, this leads quite naturally to a generalization of the NAP in which selection is applied to a number of potential nature reserves, each defining a set of changes in survival probablities. Leaving a more precise formulation to the Appendix, we may call this the Nature Reserve Problem (NRP). When each reserve only contributes towards the survival of a single taxon, the NRP reduces to the NAP.
The problem of selecting reserves with the aim of preserving biodiversity is a well-established research topic in biodiversity conservation (e.g.,] Diamond, 1975; Higgs and Usher, 1980; Pressey et al., 1993; Dobson et al., 1997; Ando et al., 1998; Howard et al., 1998; Margules and Pressey, 2000; Rodrigues et al., 2004), and recently there has been a lot of interest in defining and solving (often heuristically) optimization problems for reserve selection (e.g., Margules et al., 1988; Cocks and Baird, 1989; Underhill, 1994; Camm et al., 1996; Church et al., 1996; Csuti et al., 1997; Polasky et al., 2000; Cabeza and Moilanen, 2001; Rodrigues and Gaston, 2002; Önal and Briers, 2003). Many (if not most) of the past formalizations would simply become special cases of the NRP we propose: for example, the PD-maximization problem of Rodrigues and Gaston (2002) could be seen as the 0
0/1 NRP, and the expected species richness maximization problem of Polasky et al. (2000) as the 0
brs NRP applied to an ultrametric star tree (see Appendix for the definition of the notation used here). The NRP would thus unify the (arguably most successful) approaches from economics and biology to the problem of biodiversity conservation, thus becoming a fertile meeting ground for the exchange of new ideas between these two disciplines.
| Appendix |
|---|
|
|
|---|
The Algorithm for the Unrooted Case
Here, we define a clade as any rooted subtree
Although the collection of clades is now nonhierarchical, the fundamental properties of a single clade have not changed: a clade is still a rooted tree with one branch departing from its root and with its leaves being a subset of the leaves of
X. Therefore, the rooted subproblems defined before for all clades
and sub-budgets b
B remain well defined. We are particularly interested in solving the subproblems for the taxon-rooted clades
s and sub-budgets B–cs because, as previously observed, some of their solutions Rs provide us with solutions {s}
Rs to the unrooted problem.
All the rooted subproblems will again be implicitly solved by incrementally deriving 
(b) and 
(b) for all
and b. As we showed before, the calculation of these values is either straightforward or directly obtainable from the corresponding values for
's subclades. It is therefore necessary to tackle the clades in an order that guarantees that subclades are met before their "superclades." Whereas before this was trivially satisfied by a bottom-up traversal of all clades, now a slightly more complex approach is needed. First, a node in
X is arbitrarily chosen as top node of the tree (different choices of this node, which can be a leaf, lead to different orderings of the clades, but all produce the same final results). This determines a way to picture the tree (imagine it redrawn with the top node uppermost and all branches descending from there) and we can then classify clades into downward clades—those consisting of a branch and everything below it—and upward clades—the remaining ones. For example, in Figure 3 the top node is set to the position highlighted in pink; as a result, (A,B) is a downward clade and (C,(D,E)) an upward clade.
The
and
values are calculated for downward clades first, going in the usual bottom-up order. In Figure 3, where again we imagine that all the results are stored in a solutions table, this corresponds to filling the bottom half of the table, starting from the bottom row and going upwards. The results for the upward clades are then derived in a top-down fashion: more precisely, we can imagine that all the branches in
X are visited in a top-down order and each time a branch is visited, the
and
values for the corresponding upward clade are computed. In Figure 3, upward clades are visited in the following order: 1: (C,(D,E)); 2: (B,(C,(D,E))); 3: (A,(C,(D,E))); 4: ((A,B),(D,E)); 5: ((A,B),C); 6: (((A,B),C),E); 7: (((A,B),C),D), which corresponds to filling the top half of the solutions table from bottom to top. This ordering of the clades guarantees that whenever we derive solutions for a clade
containing subclades
and
, the solutions for
and
have already been calculated.
Once all the
and
values have been calculated, we turn our attention to the values 
s(B–cs) for all taxa s. By definition, these are equal to rPD(Rs), where Rs is a solution to the rooted problem with budget B–cs applied to
s, and therefore also equal to the u PD of the candidate solutions {s}
Rs to the unrooted problem. Clearly, the taxa s that maximize 
s(B–cs) are those contained in an optimal solution to the unrooted problem (in Figure 3, all of them except B). If we pick any one of these taxa s and reconstruct Rs by using the 
values starting from 
s(B–cs)—as described for the rooted case or, equivalently, with a call to reconstruct(
s,B–cs)—we therefore also obtain an optimal solution {s}
Rs to the unrooted problem. For example, if in Figure 3 we pick taxon C, following the 
values (indicated in the figure by dotted arrows) leads to rooted solution {A,D,E} and therefore {A,C,D,E} is a solution to the unrooted problem.
Note that whereas in this example {A,C,D,E} coincides with the set of taxa that maximize 
s(B–cs), in general the latter will not necessarily coincide with an optimal solution but rather with the union of all optimal solutions to the unrooted problem.
This concludes the description of our algorithm for maximizing u PD subject to cost constraints. Again, a more concise description is given by the pseudocode in algorithms 1–5: a solution to the unrooted problem is simply obtained by a call to double traversal(
). The cheapest optimal solution can be obtained in a similar way to that described above: just try to reduce the budget below B–cs for all taxa s that maximize 
s(B–cs); one of the taxa that lead to the largest reduction should be used as starting point for the reconstruction of the minimal-cost u PD-optimal set of taxa.
Pseudocode
Problem (1) with the rooted definition for PD is solved by a call to bottom-up(
X), which fills up the solutions table, followed by a call to reconstruct(
X,B), which returns an optimal solution (not necessarily of minimum cost, but this can easily be implemented).
Problem (1) with the unrooted definition for PD is solved by a call to double traversal(
) (where rev(
) denotes the clade rooted in the same branch as
but oriented towards the opposite side).
A simple improvement to these algorithms can be obtained by noting that the rows of the solutions table do not always need to be filled up until their last column. If a clade
is such that the total sum of the costs for its taxa, which we may denote by C
, is smaller than B, then its 
(b) and 
(b) values need only be calculated for b
C
: the solutions for b > C
are necessarily the same as the one for b = C
and simply consist of taking all taxa in
. Procedures calculate(
) and reconstruct(
,b) can be simply modified accordingly. This results in some saving of running time, although the time complexity remains O(nB2).
|
|
|
|
|
The Nature Reserve Problem
Imagine having, in addition to the species in



eq
r
Rcr
B, that maximizes the expected PD resulting from conservation of the sites in R.
The Nature Reserve Problem, which we may write ars
brs NRP, is a generalization of many problems defined in the past. In addition to the ones already mentioned, we note that the Budgeted Maximum Coverage Problem (Khuller et al., 1999) coincides with the 0
0/1 NRP applied to a star tree—where by writing 0/1 on the right of the arrow we mean that each survival probability brs is constrained to equal 0 or 1.
Taxa in Figure 4
Species abbreviations are Dma: Daubentonia madagascariensis; Mmy: Microcebus myoxinus; Mru: Microcebus rufus; Mra: Microcebus ravelobensis; Mmu: Microcebus murinus; Mco: Mirza coquereli; Atr: Allocebus trichotis; Cma: Cheirogaleus major; Cme: Cheirogaleus medius; Pfu el: Phaner furcifer electromontis; Pfu pl: Phaner furcifer pallescens; Pfu pr: Phaner furcifer parienti; Pfu fu: Phaner furcifer furcifer; Lmu: Lepilemur mustelinus; Ldo: Lepilemur dorsalis; Lse: Lepilemur septentrionalis; Led: Lepilemur edwardsi; Lmi: Lepilemur microdon; Lru: Lepilemur ruficaudatus; Lle: Lepilemur leucopus; Pta: Propithecus tattersalli; Pco: Propithecus coquereli; Pve ve: Propithecus verreauxi verreauxi; Pve de: Propithecus verreauxi deckeni; Pve co: Propithecus verreauxi coronatus; Pdi: Propithecus diadema; Ppe: Propithecus perrieri; Pca: Propithecus candidus; Ped: Propithecus edwardsi; Iin: Indri indri; Ala: Avahi laniger; Aoc: Avahi occidentalis; Acl: Avahi cleesei; Vva va: Varecia variegata variegata; Vva ru: Varecia variegata rubra; Hgr al: Hapalemur griseus alaotrensis; Hgr oc: Hapalemur griseus occidentalis; Hgr gr: Hapalemur griseus griseus; Hau: Hapalemur aureus; Hsi: Hapalemur simus; Lca: Lemur catta; Ema ma: Eulemur macaco macaco; Ema fl: Eulemur macaco flavifrons; Eco: Eulemur coronatus; Eru: Eulemur rubriventer; Emo: Eulemur mongoz; Efu co: Eulemur fulvus collaris; Efu ac: Eulemur fulvus albocollaris; Efu ru: Eulemur fulvus rufus; Efu fu: Eulemur fulvus fulvus; Efu af: Eulemur fulvus albifrons; Efu sa: Eulemur fulvus sanfordi. The phylogenetic tree used in Figure 4a and the species' conservation costs and survival probabilities are available from http://www.ebi.ac.uk/goldman/ rats.
| Acknowledgments |
|---|
|
|
|---|
We thank Klaas Hartmann, David MacKay, and two anonymous reviewers for their many helpful comments. F.P. is a member of St Catharine's College, University of Cambridge.
| References |
|---|
|
|
|---|
-
Ando A., Camm J., Polasky S., Solow A. Species distributions, land values, and efficient conservation. Science (1998) 279:2126–2128.
Andriaholinirina N., Fausser J., Roos C., Zinner D., Thalmann U., Rabarivola C., Ravoarimanana I., Ganzhorn J., Meier B., Hilgartner R., Walter L., Zaramody A., Langer C., Hahn T., Zimmermann E., Radespiel U., Craul M., Tomiuk J., Tattersall I., Rumpler Y. Molecular phylogeny and taxonomic revision of the sportive lemurs (Lepilemur, Primates). BMC Evol. Biol. (2006) 6:17.[CrossRef][Medline]
Barker G. Phylogenetic diversity: A quantitative framework for measurement of priority and achievement in biodiversity conservation. Biol. Linn. Soci. (2002) 76:165–194.[CrossRef]
Cabeza M., Moilanen A. Design of reserve networks and the persistence of biodiversity. Trends Ecol. Evol. (2001) 16:242–248.[CrossRef][Medline]
Camm J., Polasky S., Solow A., Csuti B. A note on optimal algorithms for reserve site selection. Biol. Conserv. (1996) 78:353–355.[CrossRef]
Church R., Stoms D., Davis F. Reserve selection as a maximal covering location problem. Biol. Conserv. (1996) 76:105–112.[CrossRef]
Cocks K., Baird I. Using mathematical programming to address the multiple reserve selection problem: An example from the Eyre Peninsula, South Australia. Biol. Conserv. (1989) 49:113–130.[CrossRef]
Cormen T., Leiserson C., Rivest R., Stein C. Introduction to algorithms (2001) 2nd edition. Cambridge, Massachusetts: MIT Press.
Crozier R. Genetic diversity and the agony of choice. Biol. Conserv. (1992) 61:11–15.[CrossRef]
Crozier R. Preserving the information content of species: Genetic diversity, phylogeny, and conservation worth. Annu. Rev. Ecol. Syst. (1997) 28:243–268.[CrossRef][Web of Science]
Crozier R., Agapow P., Dunnett L. Conceptual issues in phylogeny and conservation: A reply to Faith and Baker. Evol. Bioinformatics Online (2006) 2:197–199.
Csuti B., Polasky S., Williams P., Pressey R., Camm J., Kershaw M., Kiester A., Downs B., Hamilton R., Huso M., Sahr K. A comparison of reserve selection algorithms using data on terrestrial vertebrates in Oregon. Biol. Conserv. (1997) 80:83–97.[CrossRef]
Diamond J. The island dilemma: lessons of modern biogeographic studies for the design of natural reserves. Biol. Conserv. (1975) 7:129–146.[CrossRef]
Dobson A., Rodriguez J., Roberts W., Wilcove D. Geographic distribution of endangered species in the United States. Science (1997) 275:550–553.
Eddy S. A model of the statistical power of comparative genome sequence analysis. PLoS Biol. (2005) 3:e10.[CrossRef][Medline]
Eisen M., Spellman P., Brown P., Botstein D. Cluster analysis and display of genome-wide expression patterns. Proc. Nat. Acad. Sci. USA (1998) 95:14863–14868.
Everitt B., Landau S., Leese M. Cluster analysis (2001) 4th edition. London: Arnold.
Faith D. Conservation evaluation and phylogenetic diversity. Biol. Conserv. (1992) 61:1–10.[CrossRef]
Faith D., Baker A. Phylogenetic diversity (PD) and biodiversity conservation: Some bioinformatics challenges. Evol. Bioinformatics Online (2006) 2:70–77.
Forest F., Grenyer R., Rouget M., Davies T., Cowling R., Faith D., Balmford A., Manning J., Proche
., van der Bank M., Reeves G., Hedderson T., Savolainen V. Preserving the evolutionary potential of floras in biodiversity hotspots. Nature (2007) 445:757–760.[CrossRef][Medline]
Garey M., Johnson D. Computers and intractability: A guide to the theory of NP-completeness (1979) New York: W.H. Freeman and Co.
Gaston K., Spicer J. Biodiversity: An introduction (1998) Oxford, UK: Blackwell Science.
Hartmann K., Steel M. Maximizing phylogenetic diversity in biodiversity conservation: Greedy solutions to the Noah's Ark Problem. Syst. Biol. (2006) 55:644–651.
Hartmann K., Steel M. Phylogenetic diversity: From combinatorics to ecology. In: Reconstructing evolution: New mathematical and computational advances—Gascuel O., Steel M., eds. (2007) Oxford University Press.
Higgs A., Usher M. Should nature reserves be large or small? Nature (1980) 285:568–569.[CrossRef][Web of Science]
Howard P., Viskanic P., Davenport T., Kigenyi F., Baltzer M., Dickinson C., Lwanga J., Matthews R., Balmford A. Complementarity and the use of indicator groups for reserve selection in Uganda. Nature (1998) 394:472–475.[CrossRef][Web of Science]
IUCN. 2006 IUCN Red List of Threatened Species (2006) http://www.iucnredlist.org(accessed 30 December 2006).
Khuller S., Moss A., Naor J. The budgeted maximum coverage problem. Inform. Process. Lett. (1999) 70:39–45.[CrossRef]
Mace G., Gittleman J., Purvis A. Preserving the Tree of Life. Science (2003) 300:1707.
Margules C., Nicholls A., Pressey R. Selecting networks of reserves to maximize biological diversity. Biol. Conserv. (1988) 43:63–76.[CrossRef]
Margules C., Pressey R. Systematic conservation planning. Nature (2000) 405:243–253.[CrossRef][Medline]
Margulies E., Vinson J., Miller W., Ja–e D., Lindblad-Toh K., Chang J., Green E., Lander E., Mullikin J., Clamp M. An initial strategy for the systematic identification of functional elements in the human genome by low-redundancy comparative sequencing. Proc. Nat. Acad. Sci. USA (2005) 102:4795–4800.
May R. Taxonomy as destiny. Nature (1990) 347:129–130.[CrossRef][Web of Science]
McAuliffe J., Jordan M., Pachter L. Subtree power analysis and species selection for comparative genomics. Proc. Nat. Acad. Sci. USA (2005) 102:7900–7905.
Minh B., Klaere S., von Haeseler A. Phylogenetic diversity within seconds. Syst. Biol. (2006) 55:769–773.
Moritz C., Faith D. Comparative phylogeography and the identification of genetically divergent areas for conservation. Mol. Ecol. (1998) 7:419–429.[CrossRef]
Nee S., May R. Extinction and the loss of evolutionary history. Science (1997) 278:692–694.
Nehring K., Puppe C. A theory of diversity. Econometrica (2002) 70:1155–1198.[CrossRef][Web of Science]
Nehring K., Puppe C. Diversity and dissimilarity in lines and hierarchies. Math. Social Sci. (2003) 45:167–183.[CrossRef]
Önal H., Briers R. Selection of a minimum-boundary reserve network using integer programming. Proc. R. Soc. B Biol. Sci. (2003) 270:1487–1491.[CrossRef][Medline]
Pardi F., Goldman N. Species choice for comparative genomics: No need for cooperation. PLoS Genetics (2005) 1:e71.[CrossRef]
Pastorini J., Forstner M., Martin R. Phylogenetic relationships among Lemuridae (Primates): Evidence from mtDNA. J. Hum. Evol. (2002) 43:463–478.[Web of Science][Medline]
Pastorini J., Martin R., Ehresmann P., Zimmermann E., Forstner M. Molecular phylogeny of the lemur family Cheirogaleidae (Primates) based on mitochondrial DNA sequences. Mol. Phylogenet. Evol. (2001) 19:45–56.[CrossRef][Web of Science][Medline]
Petchey O., Gaston K. Extinction and the loss of functional diversity. Proc. R. Soc. B Biol. Sci. (2002a) 269:1721–1727.[CrossRef][Medline]
Petchey O., Gaston K. Functional diversity (FD), species richness and community composition. Ecol. Lett. (2002b) 5:402–411.[CrossRef][Web of Science]
Polasky S., Camm J., Solow A., Csuti B., White D., Ding R. Choosing reserve networks with incomplete species information. Biol. Conserva. (2000) 94:1–10.[CrossRef]
Pressey R., Humphries C., Margules C., Vane-Wright R., Williams P. Beyond opportunism: Key principles for systematic reserve selection. Trends Ecol. Evol. (1993) 8:124–128.[CrossRef]
Reist-Marti S., Simianer H., Gibson J., Hanotte O., Rege J. Weitzman's approach and conservation of breed diversity: An application to African cattle breeds. Conserv. Biol. (2003) 17:1299–1311.[CrossRef][Web of Science]
Rodrigues A., Andelman S., Bakarr M., Boitani L., Brooks T., Cowling R., Fishpool L., da Fonseca G., Gaston K., Hoffmann M., Long J. S., Marquet P. A., Pilgrim J. D., Pressey R. L., Schipper J., Sechrest W., Stuart S. N., Underhill L. G., Waller R. W., Watts M. E. J., Yan X. Effectiveness of the global protected area network in representing species diversity. Nature (2004) 428:640–643.[CrossRef][Medline]
Rodrigues A., Gaston K. Maximizing phylogenetic diversity in the selection of networks of conservation areas. Biol. Conserv. (2002) 105:103–111.[CrossRef]
Roos C., Schmitz J., Zischler H. Primate jumping genes elucidate strepsirrhine phylogeny. Proc. Nat. Acad. Sci. USA (2004) 101:10650–10654.
Simianer H., Marti S., Gibson J., Hanotte O., Rege J. An approach to the optimal allocation of conservation funds to minimize loss of genetic diversity between livestock breeds. Ecol. Econ. (2003) 45:377–392.[CrossRef]
Steel M. Phylogenetic diversity and the greedy algorithm. Syst. Biol. (2005) 54:527–529.
Thomas J., Touchman J., Blakesley R., Bouffard G., Beckstrom-Sternberg S., Margulies E., Blanchette M., Siepel A., Thomas P., McDowell J., et al. Comparative analyses of multi-species sequences from targeted genomic regions. Nature (2003) 424:788–793.[CrossRef][Medline]
Underhill L. Optimal and suboptimal reserve selection algorithms. Biol. Conserv. (1994) 70:85–87.[CrossRef]
van der Heide C., van den Bergh J., van Ierland E. Extending Weitzman's economic ranking of biodiversity protection: Combining ecological and genetic considerations. Ecol. Econ. (2005) 55:218–223.[CrossRef]
Vane-Wright R., Humphries C., Williams P. What to protect?—Systematics and the agony of choice. Biol. Conserv. (1991) 55:235–254.[CrossRef]
Weitzman M. On diversity. Q. J. Econ. (1992) 107:363–405.[CrossRef][Web of Science]
Weitzman M. What to preserve? An application of diversity theory to crane conservation. Q. J. Econ. (1993) 108:157–183.[CrossRef][Web of Science]
Weitzman M. The Noah's Ark Problem. Econometrica (1998) 66:1279–1298.[CrossRef][Web of Science]
Witting L., Loeschcke V. Biodiversity conservation: Reserve optimization or loss minimization? Trends Ecol. Evol. (1993) 8:417.[CrossRef]
Witting L., Loeschcke V. The optimization of biodiversity conservation. Biol. Conserv. (1995) 71:205–207.[CrossRef]
Witting L., Tomiuk J., Loeschcke V. Modelling the optimal conservation of interacting species. Ecol. Model. (2000) 125:123–143.[CrossRef][Web of Science]
Yoder A. Back to the future: A synthesis of strepsirrhine systematics. Evol. Anthropol. Issues News Rev. (1997) 6:11–22.[CrossRef]
Yoder A., Rasoloarison R., Goodman S., Irwin J., Atsalis S., Ravosa M., Ganzhorn J. Remarkable species diversity in Malagasy mouse lemurs (primates, Microcebus). Proc. Nat. Acad. Sci. USA (2000) 97:11325–11330.
This article has been cited by other articles:
![]() |
B. Q. Minh, S. Klaere, and A. von Haeseler Taxon Selection under Split Diversity Syst Biol, December 1, 2009; 58(6): 586 - 594. [Abstract] [Full Text] [PDF] |
||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||














