Skip Navigation

Systematic Biology 2007 56(3):431-444; doi:10.1080/10635150701411279
This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (PDF) Freely available
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in ISI Web of Science
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrow Search for citing articles in:
ISI Web of Science (7)
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Pardi, F.
Right arrow Articles by Goldman, N.
Right arrow Search for Related Content
PubMed
Right arrow Articles by Pardi, F.
Right arrow Articles by Goldman, N.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?

© 2007 Society of Systematic Biologists

Resource-Aware Taxon Selection for Maximizing Phylogenetic Diversity

Edited by Mike Steel: Associate Editor

Fabio Pardi and Nick Goldman

EMBL—European Bioinformatics Institute Wellcome Trust Genome Campus, Hinxton, Cambridge, CB10 1SD, UK E-mail: pardi{at}ebi.ac.uk (F.P.)


    Abstract
 Top
 Abstract
 A Dynamic Programming Algorithm...
 The Unrooted Case
 Applications to the Noah's...
 Discussion
 Appendix
 Acknowledgments
 References
 
Phylogenetic diversity (PD) is a useful metric for selecting taxa in a range of biological applications, for example, bioconservation and genomics, where the selection is usually constrained by the limited availability of resources. We formalize taxon selection as a conceptually simple optimization problem, aiming to maximize PD subject to resource constraints. This allows us to take into account the different amounts of resources required by the different taxa. Although this is a computationally difficult problem, we present a dynamic programming algorithm that solves it in pseudo-polynomial time. Our algorithm can also solve many instances of the Noah's Ark Problem, a more realistic formulation of taxon selection for biodiversity conservation that allows for taxon-specific extinction risks. These instances extend the set of problems for which solutions are available beyond previously known greedy-tractable cases. Finally, we discuss the relevance of our results to real-life scenarios.

Keywords: Biodiversity conservation; comparative genomics; dynamic programming; phylogenetic diversity; Noah's Ark Problem; species choice; taxon selection

Received September 21, 2006; Revised November 16, 2006; Accepted December 29, 2006


The phylogenetic diversity (PD) of a set of taxa (Faith, 1992), (roughly) the same amount of resources. In reality,loosely defined as the total length of the evolutionary tree connecting them, is a measure of increasing importance in a number of areas in biology, in particular biodiversity conservation (Crozier, 1997; Barker, 2002; Mace et al., 2003; Forest et al., 2007) and comparative genomics (see Pardi and Goldman, 2005, and references therein). In bioconservation, the protection from extinction of species or populations that have a large total PD or, equivalently, that represent a large portion of evolutionary history: Nee and May, 1997) is regarded as a good way to preserve genetic diversity (Crozier, 1997) and, more generally, diversity in the biological features of the existing organisms (Faith, 1992). In genomics, on the other hand, comparing homologous sequences with a high total divergence (which is to sequences what PD is to taxa) allows testing of evolutionary hypotheses and detection of various genomic features with high statistical power (Thomas et al., 2003; Eddy, 2005; Pardi and Goldman, 2005). Comparative genomics, like biological conservation, is therefore better done on phylogenetically diverse sets of genes (or individuals, populations or species: the unit is not important and we will use the word taxa throughout). Genome sequencing projects are increasingly aware of this (Margulies et al., 2005).

Often it will not be possible to save every existing species from large extinction events (such as the one we are currently causing) or to sequence every available genome. It is therefore natural to ask how choices should be made in these areas—"the agony of choice" is a popular phrase in bioconservation (Vane-Wright et al., 1991; Crozier, 1992). TX their (possibly rooted). Because PD has come to be regarded as a good metric for measuring taxon importance, there has recently been interest in the formal optimization problems associated with it: how should we select taxa in order to maximize the resulting PD? The simplest of these problems, consisting of selecting a given number (decided a priori) of taxa, is easily solved by a greedy strategy (Steel, 2005; Pardi and Goldman, 2005). However, this scenario assumes that it is feasible to determine in advance the number of taxa that will be dealt with by the available resources, which is only true if the taxa require (roughly) the same amount of resources. In reality, this is usually not true: in bioconservation, for example, we may be designing a protected geographical region of fixed area, and different species will typically require different amounts of land; similarly, in the case of sequencing, different genomes have different sizes and therefore will have different costs, not only in terms of money but also of time and instruments required for sequencing. Potentially, a choice will have to be made between selecting few "expensive" taxa or many "cheap" ones.

Assuming that "costs" (whatever the available resource, e.g., money, time, labor, machinery, space, etc.) can be roughly quantified, a possible approach to limit "expenditure" could consist of modifying the evolutionary tree used as a basis for calculating PD by shortening each terminal branch by an amount depending on the cost of its taxon, so that the selection of costly taxa would be discouraged (Steel, 2005; Pardi and Goldman, 2005). This is not very satisfying; for example, there is no guarantee that the selected taxa will have maximum PD among all the other choices of taxa with the same total cost.

Here, we adopt a more direct approach: given taxon-associated costs and an estimate of the available resources, which we naturally call the budget, we aim to select the set of taxa of maximum PD among those sets with total cost at most equal to the budget. This can be re-expressed using some formalisms that will be useful throughout this paper. Let Formula be the chosen phylogenetic scope (Pardi and Goldman, 2005), i.e., the set of taxa we aim to select from, and Formula X their (possibly rooted) phylogenetic tree, where all branches have non-negative lengths. Each taxon (e.g., species or sequence) sisinFormula has a nonnegative integer cost cs. We aim to


Formula 1

(1)
where B is an integer representing the budget.

Although PD(S) has been simply defined as the total length of the smallest subtree of Formula X connecting all taxa in S (the "minimum spanning path" of Faith, 1992), there have been different interpretations of this definition in the literature (see, for example, the discussion in Faith and Baker, 2006; Crozier et al., 2006). These probably derive from different understandings of what a subtree is: if Formula X is a rooted tree, do its subtrees necessarily include its root? For example, does the smallest subtree of the mammals that contains human and chimp necessarily contain also the common ancestor of all mammals? It is clear that different ways of answering this question lead to different values of PD and sometimes to different solutions to the optimization problem (1). It is therefore useful to distinguish between the two possible definitions of PD. We will call the unrooted phylogenetic diversity (u PD) of a set of taxa SsubeqFormula the total length of the smallest unrooted subtree of Formula X connecting all the taxa in S but not necessarily the root (assuming there is one). Conversely, when Formula X is rooted, the total length of the smallest subtree of Formula X having the same root as Formula X and connecting all the taxa in S will be referred to as the rooted phylogenetic diversity (r PD) of S (see Fig. 1). Clearly, denoting the root of Formula X by {rho} (and assuming {rho}isinFormula ), we have that rPD(S) = uPD(S{cup}{{rho}}). In practice, the choice between these two measures will be determined by the intended application: u PD seems more appropriate for comparative genomics (as most sequence comparison techniques are independent of root placement), whereas r PD seems to be preferred in conservation biology (Rodrigues and Gaston, 2002). Furthermore, as we will show in the following, the optimization problems associated with these two metrics can be of rather different difficulty, with u PD generally posing more problems than r PD. Insisting on this distinction is therefore not merely a pedantic exercise.


Figure 1
View larger version (19K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Figure 1 Difference between two definitions of PD. Highlighted are the branches whose lengths are summed in order to give r PD(S) (left) and u PD(S) (right). Assuming branch lengths are proportional to those in the figure, S is an optimal subset of size 2 for r PD but not u PD (u PD-optimal subsets of size 2 consist of the taxon on the left of the root, plus any other taxon). This shows that the solution of problem (1) is dependent on the definition of PD.

 
Incidentally, this ambiguity in the notion of PD is not surprising if we consider that Faith's paper introducing PD (Faith, 1992) was itself somewhat ambiguous: whereas one of the formal results (Faith, 1992: 4, last line) only holds for u PD, the only example that allows discrimination between the two interpretations (Faith, 1992: fig. 3a, set R3), is consistent with r PD but not u PD. In later papers (e.g., Moritz and Faith, 1998; Faith and Baker, 2006), Faith and colleagues clarified their preference for r PD, but the alternative definition for PD has become widespread (e.g., Crozier, 1997; Barker, 2002; Steel, 2005; Minh et al., 2006) and other authors even refer to r PD as evolutionary history (Nee and May, 1997).

One way of solving problem (1) would be to calculate the cost and PD for all 2n subsets of the taxa, where n = |Formula | is the number of taxa being selected among. However, this is not feasible even for moderate values of n, and therefore alternative approaches must be sought. The next two sections give novel algorithms that efficiently solve problem (1) for both definitions of PD. Before then, however, some considerations are needed. First, these algorithms will assume that Formula X is a bifurcating tree, which clearly is not a limitation, as every multifurcating tree can be resolved into an equivalent bifurcating tree in which the new (internal) branches have length 0. Second, note that the formulation (1) can also be used to express the scenario whereby a number of taxa have already been conserved (or sequenced): the search for a maximally diverse extension (Pardi and Goldman, 2005) of an initial set IsubFormula can be implemented by solving problem (1) where all leaves in I have their cost set to 0. Third, note that all costs and the budget are (implicitly in the following) assumed to be integers. As will be shown, the algorithms' running times grow quadratically with the budget B. It is therefore important to express all costs and budget as multiples of large units to limit the value of B and, consequently, running times. This granularity is not unrealistic, since real-life resources are inherently discrete (e.g., currencies) and costs and budgets will often be known not with great precision but only approximately.

Compared to the problems for which an efficient solution is known (Steel, 2005; Pardi and Goldman, 2005), problem (1) is clearly a step towards realism. In the context of biodiversity conservation, however, there is an important factor that is not taken into account: different taxa have different risks of extinction (Witting and Loeschcke, 1993, 1995). For example, it is clear that if a species is not endangered at all, then expending resources to conserve this or a very closely related species is not a good choice. It is intuitive that conservation efforts should concentrate on the most endangered species, although under some particular conditions the contrary seems to be true (Weitzman, 1998). These aspects have been formalized by Weitzman (1992, 1993, 1998) into the suggestively named "Noah's Ark Problem" (NAP) (Weitzman, 1998), which is receiving increasing attention especially in environmental and ecological economics (Simianer et al., 2003; Reist-Marti et al., 2003; van der Heide et al., 2005). Although Weitzman (1998) derived a myopic, or greedy, ranking criterion that assigns each taxon a conservation priority score, an exact algorithmic solution to the NAP remains an open problem.

Hartmann and Steel (2006) have shown that some special cases of the NAP can be solved, again, with a greedy algorithm. The algorithms we present here allow the resolution of further relatively general cases of the NAP, extending the scenarios considered by Hartmann and Steel. Although the main focus of this paper is on results that can be used in more than just biodiversity conservation, we will illustrate the consequences of our work for the NAP in Applications to the Noah's Ark Problem.

Finally, we note that problem (1) (and the NAP we define below, which generalizes it) is NP-hard, as the knapsack problem, another well-known NP-hard problem (Cormen et al., 2001), is simply its special case for star-shaped trees (Hartmann and Steel, 2006). Technically, our algorithms are examples of pseudo-polynomial time algorithms: although they run in time polynomial in B, this number may itself grow exponentially in the size of the input (see Garey and Johnson, 1979, for further discussion).


    A Dynamic Programming Algorithm for the Rooted Case
 Top
 Abstract
 A Dynamic Programming Algorithm...
 The Unrooted Case
 Applications to the Noah's...
 Discussion
 Appendix
 Acknowledgments
 References
 
When phylogenetic diversity is defined as r PD (as we implicitly assume throughout this section), problem (1) can be solved with a dynamic programming algorithm, which we will describe in this section. In the next section we will show that this algorithm can be extended to solve the unrooted (u PD) version of problem (1).

The key observation that allows the use of a dynamic programming algorithm is that optimal solutions to problem (1) can be simply decomposed into optimal solutions to its subproblems—technically, problem (1) is said to have optimal substructure (Cormen et al., 2001). As a consequence, optimal solutions to problem (1) can be constructed by first tackling and solving its subproblems and then combining the optimal solutions thus found.

In order to see what this means in practice, we need to introduce some concepts that allow us to apply problem (1) to smaller portions of Formula X. We define a clade as any subtree of Formula X consisting of a branch a and everything else below a in Formula X (note that we imagine Formula X with its root at the top); a clade should be thought of as rooted at the top of its top branch a. In a bifurcating tree, every clade is composed of a root branch and, possibly, by two other clades, which we call its subclades. (For example, in Fig. 2, left, clade Formula has two subclades, Formula and Formula , whereas Formula is composed of one terminal branch and no subclades.)


Figure 2
View larger version (26K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Figure 2 On the left, an instance of problem (1) (rooted case). Branch lengths are indicated by the numbers to the side of the branches. Taxon costs are in the boxes next to the leaves (green) and the budget is B = 8. The following clades of Figure 2X are highlighted: Figure 2 (broad grey branches), Figure 2 (red branches), Figure 2 (blue branches). On the right, the corresponding solutions table (the content of some rows is omitted for clarity). The correspondence between clades and rows is indicated by dotted lines (colored in the cases of Figure 2, Figure 2, and Figure 2). Rows are ordered according to a top-down visit of all clades of Figure 2X: 1: ((A,B),(C,(D,E))); 2: (A,B); 3: A; 4: B; 5: (C,(D,E)); 6: C; 7: (D,E); 8: D; 9: E. To the right of the table are cFigure 2 values and arrows indicating the dependencies among the rows. Column headings 0 – 8 indicate sub-budgets b. The top right corner (shaded orange) of the solutions table indicates the optimal r PD (11, achieved by selecting taxa {A, C, D, E}). A naïve greedy algorithm—consisting of always selecting the taxon that adds most PD among the ones that can be selected with the currently available budget—would not work in this instance: after selecting D and A (at a cost of 3 + 1 = 4), the remainder of the budget (B–4 = 4) would be used on B, leading to a total r PD of 10.5. Also note that another greedy algorithm—consisting of always selecting the taxon with the highest ratio between added r PD and cost cs (Hartmann and Steel, 2006)—would work in this instance but fails on the problem with budget B = 4; whereas the (unique) optimal solution for this budget is {A, D}, with an r PD of 9 (yellow cell), this greedy algorithm would initially select A and then E, thus including a taxon that is not part of the optimal solution.

 
A way to decompose problem (1) into smaller subproblems is to ask what is the best way to invest a given part of the budget into a given clade (i.e., which of the taxa in this clade should be selected given that part of the budget). It turns out that we can answer this question for all clades Formula and sub-budgets b ≤ B incrementally, starting from the clades consisting of only a terminal branch and then using the solutions already found to construct the solutions for larger clades.

Formally, the subproblems consist of finding, for every bisin{0,1,...,B} and every clade Formula , a subset S of the taxa in Formula that maximizes rPDFormula (S) subject to {sum}sisin Scs ≤ b. Here rPDFormula (S) denotes the rooted PD calculated as if Formula was the whole tree (e.g., in Fig. 2, rPDFormula ({E}) = 3.0). S will be called an (optimal) solution for Formula and b. Note that one of these subproblems (the one with Formula = Formula X and b = B) simply coincides with problem (1) itself (Formula X can always be seen as having a root branch, possibly of zero length, and therefore is a clade itself). Optimal solutions to these subproblems (or more precisely some sufficient information about them, as we will see later) can be stored in a table (the solutions table) whose rows correspond to all the different clades and whose columns correspond to all the sub-budgets 0,1,...,B (see Fig. 2). Clearly, position (Formula ,b) will contain (information about) an optimal solution for Formula and b. We will show that this table can be completed one row at a time, starting from the bottom and going up.

First, the solutions for the clades consisting of only a terminal branch (leading to, say, taxon s) are simply either the empty set {emptyset} or {s}, depending on whether the taxon is too expensive to be taken with the available sub-budget (i.e., cs > b), or not (cs ≤ b), respectively. Therefore the rows of the solutions table corresponding to terminal branches (in Fig. 2, the 3rd, 4th, 6th, and the last two) can be filled without looking at the content of any other row.

Second, when instead a clade Formula contains two subclades (Formula and Formula ), an optimal solution S for Formula and sub-budget b will simply amount to the union S = SFormula {cup} SFormula of two optimal solutions for two other subproblems: SFormula will be optimal for Formula and some sub-budget i ≤ b, whereas SFormula will be optimal for Formula and bi. (The reason for this is simple: S is naturally partitioned into SFormula and SFormula , containing the taxa in Formula and Formula , respectively; calling i the total cost of SFormula , if either SFormula were not optimal for Formula and i or SFormula were not optimal for Formula and bi, then we could replace this suboptimal choice of taxa with a better one and therefore improve also S, but this would contradict the fact that S is optimal.) For example, in Figure 2, an optimal solution for Formula and b = 4 is {C,E}, where {C} is optimal for Formula and sub-budget 2, and {E} is optimal for Formula and 2.

Importantly, if we have already calculated and stored optimal solutions for Formula and Formula and for all sub-budgets, it becomes possible to find a solution for Formula and any given b: denoting by SFormula (i) the solution stored for Formula and i (and similarly for SFormula (j)), just compare all the subsets SFormula (0){cup} SFormula (b), SFormula (1){cup} SFormula (b–1), ..., SFormula (b){cup} SFormula (0) and take the one with the largest r PD. This is guaranteed to find an optimal solution for Formula and b, because the possibility of decomposing an optimal solution S into SFormula {cup} SFormula (where SFormula is optimal for Formula and some iisin{0,1,...,b}, and SFormula is optimal for Formula and bi) implies that rPD(S) = rPD(SFormula {cup} SFormula ) = rPD(SFormula (i){cup} SFormula (bi)) and therefore SFormula (i){cup} SFormula (bi) is also optimal. Note that if there are multiple optimal solutions for Formula and i (or for Formula and bi), some of which are empty and some of which are not (possible if there are paths of zero length from the root of the clade to some of the taxa), we must ensure that the stored solution SFormula (i) (or SFormula (bi)) is nonempty. Otherwise, we may have that the union of optimal solutions is nonoptimal (specifically, when rPD(SFormula {cup} SFormula ) > 0 and rPD(SFormula (i){cup}SFormula (bi)) = rPD({emptyset}{cup}{emptyset}) = 0).

Consequently, we can fill the entire solutions table one row at a time: because the content of the row for a clade Formula can be completely derived from the content of the rows for its subclades Formula and Formula , we just need to make sure that, whenever we fill the row for a clade, the rows for its subclades have already been filled. This can be achieved by dealing with the clades in a bottom-up order: let the rows in the table be ordered according to a top-down traversal of all clades (as illustrated in Fig. 2); then, fill the rows from the last to the first. Once the entire table has been filled, the solution to problem (1) is available from its top right corner.

Although storing entire solutions in the solutions table is feasible, this is certainly not efficient, as solutions can contain many taxa and therefore require a variable amount of memory and time to be handled. As we will show, instead of storing solutions, it is sufficient to retain their r PD and the way the sub-budget should be partitioned between the two subclades (if there are any). More precisely, let SFormula (b) be the solution found for clade Formula and sub-budget b; instead of storing SFormula (b), we will store two quantities: (a) the maximum r PD achievable in Formula with sub-budget b, which we call {lambda}Formula (b) and is equal to rPDFormula (SFormula (b)); and (b) (only when Formula has two subclades) the sub-budget that solution SFormula (b) allocates to Formula 's left subclade, which we call {iota}Formula (b)—obviously this also determines the sub-budget to allocate to the right subclade, b{iota}Formula (b). Note that this does not mean that the actual expenditures in the subclades need equal {iota}Formula (b) and b{iota}Formula (b), but rather that the left part of SFormula (b) is optimal for {iota}Formula (b), and its right part for b{iota}Formula (b). In Figure 2, each cell in the solutions table shows {lambda}Formula (b) at the top and, when appropriate, in the two bottom corners, {iota}Formula (b) on the left and (for illustrative purposes) b{iota}Formula (b) on the right, which indicates the way the sub-budget should be partitioned between the two subclades.

Note that it is precisely thanks to storing {iota}Formula (b) that solutions need not be memorized; this information (once available) allows us to reconstruct the optimal solution found for any given Formula and b: just visit all the clades below Formula in a top-down fashion always using the {iota}Formula values to find out what part of b should be devoted to each clade; in the end, b will have been broken into all the expenditures to allocate to each taxon in Formula ; a taxon s should be included in the solution only if the expenditure for s is greater or equal to its cost cs (see Appendix, Algorithm 3, reconstruct(Formula ,b), for a formalization of this procedure). For example, imagine that we wish to reconstruct the solution for clade Formula and sub-budget 4 in Figure 2; the two values at the bottom corners of the table entry for Formula and 4 (shaded gray) indicate that we should allocate 2 to Formula and 2 to Formula . Formula only contains taxon C, whose cost is exactly 2 and therefore should be selected. Formula has two subclades, one containing D and the other E; the table entry for Formula and 2 indicates that nothing should be spent in the left subclade (so D should not be selected, as its cost is greater than 0) and that 2 should be assigned to the right subclade (so we do select E, as its cost is exactly 2); therefore, the solution for Formula and 4 is {C,E}.

The solutions table is filled with {lambda}Formula (b) and {iota}Formula (b) values in a way analogous to the one we described above in terms of whole solutions. The clades are visited in a bottom-up order. For the clades only consisting of a terminal branch (leading to, say, taxon s), the {lambda}Formula (b) values are set to 0 for entries with b up to (but not including) cs and to the length of the terminal branch for the remaining entries; the {iota}Formula (b) are left undefined, as they have no meaning for these clades. For example, see the solutions table row for Formula (shaded red) in Figure 2.

When instead we visit a clade Formula composed of a branch (a) and two subclades (Formula and Formula ), we need to consider two cases. First, for all the entries with b smaller than the minimum cost among the taxa in Formula (which we denote by cFormula ), {lambda}Formula (b) is set to 0, as clearly the sub-budget b is not enough to cover the cost of any of the taxa in Formula ; note also that for these entries {iota}Formula (b) can be set to any i = 0,1,...,b, as any of these values will lead to reconstructing the empty solution (for example, see the first two entries in the table row for Formula in Fig. 2). Second, for the remaining entries (those with b ≥ cFormula ), {lambda}Formula (b) is set to the maximum value among ta+{lambda}Formula (0)+{lambda}Formula (b), ta+{lambda}Formula (1)+{lambda}Formula (b–1),..., ta+{lambda}Formula (b)+{lambda}Formula (0) (where ta represents the length of a), as this is the r PD of the best possible combination of complementary solutions for Formula and Formula ; the corresponding {iota}Formula (b) is set to a value of iisin{0,1,...,b} that maximizes {lambda}Formula (i)+{lambda}Formula (bi). For example, suppose we aim to fill the entry for Formula and 4 in Figure 2. Because we are proceeding in a bottom-up fashion, the rows corresponding to Formula and Formula have already been filled, and the {lambda}Formula and {lambda}Formula values are all available. Therefore, {lambda}Formula (4) = max {1.0+{lambda}Formula (0)+{lambda}Formula (4),1.0+ {lambda}Formula (1)+{lambda}Formula (3),...,1.0+{lambda}Formula (4) +{lambda}Formula (0)} = max {5.0,5.0,5.0,2.0,2.0} = 5.0. This corresponds to consideration of combining the first entry shaded in red with the fifth in blue, the second in red with the fourth in blue, and so on; in the end we check which of these combinations has given the largest r PD. The value of {iota}Formula (4) is set accordingly: in this case, there are three equivalent combinations and {iota}Formula (4) could be set to 0, 1, or 2 (leading to two different but equally good solutions).

In summary, the {lambda} and {iota} values are calculated with the following recursions (which assume that Formula is composed of branch a and, possibly, subclades Formula and Formula ):


Formula

The above argmax term indicates the value of i that maximizes the expression on its right; when there are several such values, it indicates any one of them (e.g., chosen randomly) or, in the special case where b ≥ cFormula and {lambda}Formula (i)+{lambda}Formula (bi) = 0 for all i, any value of i such that the resulting {iota}Formula (b) will cause reconstruction of a nonempty solution (this can be achieved by taking either i = cFormula if cFormula ≤ b, or i = bcFormula if cFormula ≤ b). Implicitly storing the empty solution would either be wrong (if ta > 0, any optimal solution is certainly nonempty) or (if ta = 0) might lead to constructing a nonoptimal solution further up in the tree (see discussion above).

The various minimum costs cFormula for all clades should also be derived and stored. For a given Formula , this can be done when we set about filling its row, by either simply copying cs (if Formula is a terminal branch leading to s) or taking the minimum between cFormula and cFormula (if Formula contains subclades Formula and Formula ). In Figure 2, the cFormula values are reported on the right of the solutions table.

Once all the {lambda} and {iota} values have been derived in the solutions table, the solution implicitly found for the top right entry of the table is also a solution to problem (1) and it can be reconstructed using the {iota}Formula values in the way we described above. However, there may be more than one optimal solution to problem (1), all equally good with respect to r PD but possibly involving different overall expenditures. It is not guaranteed that the solution reconstructed as described above is the cheapest among them. Although this is not required by the problem formulation (1), this is clearly a desirable property and easy to achieve: by looking at the solutions found for smaller sub-budgets, we can check if the budget can be reduced without affecting the optimal r PD. This corresponds to scanning the solutions table from its top right entry towards the left, until a reduction in {lambda}Formula X(b) is observed. The solution for the last (i.e., least) b with {lambda}Formula X(b) = {lambda}Formula X(B) is a minimal-cost r PD-optimal solution and can be reconstructed in the usual way. In some cases it may even be of interest to derive all (minimal-cost) optimal solutions, which could be achieved by storing mutiple {iota}Formula (b) values and using all of them in the reconstruction at the end. However, we note that the number of solutions to reconstruct may grow exponentially in the size of the problem.

This concludes the description of our dynamic programming algorithm for maximizing r PD subject to cost constraints. A different description, less verbose and directly convertible into computer code, is given by the pseudocode in Algorithms 1, 2, and 3 in Appendix. Additionally, the dynamic programming algorithm could be modified to solve problem (1) for trees with branches of any length (i.e., where negative lengths are allowed), but for simplicity we do not describe this here.

Regarding the computational complexity of our algorithm, the calculation of a single entry in the solutions table requires O(b) = O(B) time, as all possible ways to split sub-budget b may need to be examined. This must be repeated for each of the (2n–1)(B+1) = O(nB) subproblems (where n = |Formula |), giving a total of O(nB2) operations for filling the solutions table. The reconstruction of an optimal solution for problem (1) from the top right entry only takes O(n) time, as it consists in a top-down traversal of all the 2n–1 = O(n) clades of Formula X, in which each clade can be dealt with in constant time. Therefore, the entire algorithm has time complexity O(nB2). Memory complexity is dominated by the size of the solutions table, and so is O(nB).

Finally, we note that problem (1) could also be formulated as an integer linear programming (ILP) problem (in a way analogous to that of Rodrigues and Gaston, 2002)and solved with standard off-the-shelf techniques. However, there are no guarantees that the running time of these algorithms would be better than exponential in n.


    The Unrooted Case
 Top
 Abstract
 A Dynamic Programming Algorithm...
 The Unrooted Case
 Applications to the Noah's...
 Discussion
 Appendix
 Acknowledgments
 References
 
We now turn to solving problem (1) with PD defined as u PD, which we call the unrooted problem. An example is given in Figure 3. At first glance, an obstacle to its solution seems to be that there is no evident way to break it into smaller problems: even if we solve the unrooted problem on portions of Formula X, this does not tell us much about the solution for the whole tree.


Figure 3
View larger version (18K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Figure 3 On the left, an instance of problem (1) (unrooted case). Branch lengths are indicated by the numbers to the side of the branches. Taxon costs are in the boxes next to the leaves (green) and the budget is B = 8. On the right, the corresponding solutions table (partially filled, for clarity). Rows correspond to the clades specified to the left, whereas column headings 0–8 indicate sub-budgets b. Arrows to the right of the table indicate the dependencies amongst the rows; see Appendix for an explanation of the row ordering (and the use of the node highlighted in pink). Orange cells correspond to solutions to the rooted subproblem for clades Figure 3s and sub-budgets Bcs. Dotted arrows and yellow cells show the reconstruction of the solution {A, C, D, E} to the unrooted problem (visited taxa are marked with a green tick or a red cross, depending on whether they should be selected or not, respectively).

 
The key observation here is that a solution to the unrooted problem, if not the empty set, is equal to {s}{cup} R, where R is a solution to the rooted problem with budget Bcs applied to Formula s, which denotes the version of Formula rooted in taxon s. (For example, in Figure 3, {A,C,D,E}, the optimal solution to the unrooted problem, can be written as {A} {cup} {C,D,E}, where {C,D,E} is a solution to the rooted problem on Formula A with budget BcA = 8–1 = 7.)

This allows us to reduce the unrooted problem to a number of related rooted problems: we could iteratively root Formula X in each of its taxa s and calculate a solution Rs to the rooted problem with budget Bcs (or skip s if cs > B); any of the subsets {s}{cup} Rs with the largest u PD (or, equivalently, any of the ones with the largest rPD(Rs)) is a solution to the unrooted problem.

However, this procedure would involve repeating the r PD-maximization algorithm once for every taxon, thus requiring O(n2B2) time. A more efficient approach, which we describe in the Appendix, consists of extending our notion of clade so that it includes all the different taxon-rootings Formula s; as before, we solve the rooted subproblems (with many possible sub-budgets) for all the clades in Formula X. Because this now includes additional clades not present before, we devise a new ordering of the clades so that we can incrementally derive new solutions from the previously calculated ones. In the end, we compare the rooted solutions found for the various Formula s; any of the best ones will provide us with an optimal solution to the unrooted problem.

This more efficient algorithm for the unrooted problem is practically equivalent in computational complexity to the one in the last section. As described in the Appendix, there are now 2(2n–3) clades, and therefore still O(n) rows in the solutions table (and O(B) columns). Derivation of the solution for Formula and b from the other stored solutions and reconstruction of any of these solutions using the {iota}Formula values are achieved in exactly the same way as before and therefore still take O(B) and O(n) time, respectively. Comparing rPD(Rs) for all the rooted solutions Rs in order to find the unrooted solution {s}{cup} Rs also takes just O(n) time. Therefore, this algorithm has time complexity O(nB2) and memory complexity O(nB).


    Applications to the Noah's Ark Problem
 Top
 Abstract
 A Dynamic Programming Algorithm...
 The Unrooted Case
 Applications to the Noah's...
 Discussion
 Appendix
 Acknowledgments
 References
 
In the NAP (Weitzman, 1998) each taxon sisinFormula in a phylogenetic tree Formula survives until some future time with a probability ps that depends on the expenditure that is put into conserving s. The objective is to subdivide a given budget B among the available taxa to maximize the r PD of the surviving taxa or, more precisely, the expected value of this random variable. By increasing the expenditure {gamma}s for taxon s, the probability of survival will also (generally) increase as a function ps({gamma}s). In the original formulation (Weitzman, 1998), ps grows linearly from an initial value as to a maximum probability bs, as the expenditure is increased from 0 to a cost cs. For this shape of ps({gamma}s) (and others) optimal solutions to the NAP are "extreme"; i.e., for every taxon s (with the possible exception of one taxon), either nothing or the whole cs is spent on it (Weitzman, 1998). Following this, Hartmann and Steel (2006) have cast the NAP in a form (not entirely equivalent to the original NAP) analogous to that of problem (1):


Formula 2

(2)
Here, Formula (r PD | S) denotes the expected r PD of the taxa that "survive," where taxa survive independently with a probability of either bs or as, depending on whether sisin S or not. It is easy to see that


Formula 3

(3)
(Hartmann and Steel, 2006), where the sum is over all branches of Formula and Ca denotes the set of taxa below branch a (again, Formula is pictured with its root at the top).

Note that problem (1) is the particular case of problem (2) obtained by setting as = 0 and bs = 1 for all taxa. We will refer to problem (1) as the 0 Formula 1 NAP and to (2) as the as Formula bs NAP. This notation indicates the probabilities of survival without and with conservation (left and right of the arrow) and the conservation costs (above); these may be constants or taxon-dependent (indicated by variables indexed by taxon s).

There is in fact a hierarchy of subproblems of the as Formula bs NAP. The simplest of them is the 0 Formula -> 1 NAP, which can be solved with a greedy algorithm (Steel, 2005; Pardi and Goldman, 2005). There are other greedy-tractable regions in this hierarchy, notably the 0 Formula 1 NAP applied to an ultrametric tree and the 1–qs Formula -> 1–{kappa} qs NAP, where for every taxon s the initial probability of extinction qs can be reduced by a constant factor {kappa} (0 ≤ {kappa} ≤ 1) by paying a constant price c (Hartmann and Steel, 2006).

We will show that other (larger) regions of this hierarchy can be solved with dynamic programming algorithms: our base algorithm for problem (1) already demonstrated this for the 0 Formula 1 NAP, and we now show that the same holds, more generally, for the as Formula 1 NAP. This is the relatively realistic scenario whereby conservation projects with variable costs cs completely ensure survival of the species to which they are applied, which can have different initial risks of extinction (as depends on s). This may be the case when a taxon can be saved by simply saving a few of its individuals, be it in a zoo or in a prophet's ark. An example of as Formula 1 NAP is given in Figure 4: panel (a) gives a tree Formula X for 52 Madagascar lemurs (including all those in the current IUCN Red List of Threatened Species, IUCN, 2006) and putative costs cs of conserving each of them, and panel (b) shows the probabilities of survival without conservation, as, and their effect on this problem, as explained below.


Figure 4
View larger version (211K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Figure 4 Application of our algorithms to the conservation of lemurs in Madagascar. Example for illustrative purposes only (the data are partly concocted). (a) An instance of problem (1) (the Figure 4 1 NAP). A tentative phylogenetic tree of 52 lemurs (species and subspecies) was drawn on the basis of some recent publications (e.g., Yoder, 1997; Yoder et al., 2000; Pastorini et al., 2001, 2002; Roos et al., 2004; Andriaholinirina et al., 2006). The correspondence between taxa and abbreviations is reported in the Appendix. Taxon conservation costs (in the boxes next to the leaves) were estimated from information available from the IUCN Red List (IUCN, 2006) and can be considered as expressed in terms of the underlying resource of limited availability; e.g., as millions of euros. The solution of this instance for a budget B = 20 consists of taking the taxa in the set {Hgr gr, Ala, Iin, Lmi, Pfu el, Cme, Atr, Mmu, Dma}. Highlighted is the tree that spans these taxa. We note that this solution cannot be obtained from the one for B = 19 ({Hgr gr, Ala, Iin, Lmi, Pfu el, Cme, Cma, Mmu, Dma}) through a greedy step (i.e., a simple addition) but needs exchange of one taxon (Cma) for another (Atr). (b) Phylogenetic tree obtained from the one in (a) by applying the transformation described in the text for the as Figure 4 1 NAP. For each taxon s, its cost cs is indicated by the adjacent box and its probability of survival as by the white area in the adjacent circle. The probabilities of survival were derived from the IUCN Red List classifications (IUCN, 2006): taxa classified in risk categories CR, EN, VU, NT, LC were given probabilities 5, 25, 50, 75, and 95%, respectively. Highlighted is the tree that spans the taxa in the solution for the underlying as Figure 4 1 NAP with budget B = 20. Again, this solution cannot be obtained through a greedy step from the one for B = 19 but needs exchange of one taxon (Mra) for other two taxa (Aoc and Mmy). (c) Plot showing the (expected) phylogenetic diversity of the optimal solution for the two NAP instances above as a function of the budget B. The lower and upper graph correspond to the 0 Figure 4 1 NAP and the as Figure 4 1 NAP, respectively. The vertical dashed line corresponds to the budget (20) for the two solutions above, which achieve a r PD and Figure 4(rPD) equal to 58 and 87% of the total tree length (5.44), respectively.

 
The key observation here is that, for any instance of the as Formula 1 NAP, it is possible to transform the input tree Formula into a new tree Formula ' so that the gain in Formula (rPD) obtained by conserving the taxa in S coincides with the r PD of S in Formula ', for any possible subset SsubeqFormula . As a consequence, we can solve any as Formula 1 NAP by solving with our base algorithm the corresponding r PD-maximization problem on the transformed tree Formula '. In Figure 4, for example, the tree in panel (b) is the transformation Formula ' of the tree Formula X in panel (a).

Formally, define Formula ' as the tree obtained from Formula by multiplying each branch length ta by a factor equal to prodsisin Ca(1–as). Then, the gain in Formula (rPD) due to conserving the taxa in any subset S can be simply derived from Equation (3):


Formula

and is equal to rPD(S) calculated on Formula ', as stated above.

It is interesting to reflect on the intuitive meaning of this transformation. The new tree is obtained by multiplying each branch by the probability that, as a result of extinctions, that branch will disappear from the tree connecting the surviving taxa. Therefore, we can think of this as a form of weighting that gives more importance to those parts in the tree that are more likely to get lost. For example, in Figure 4b, most of the internal branches in Formula ' have a length that is very close to 0. This is because these branches have many taxa below them and are therefore unlikely to be lost. Contrast this with the internal branch leading to the two Vva subspecies: this branch is relatively long, as both Vva ru and Vva va are likely to become extinct.

Thanks to this transformation, a subset of taxa that maximizes Formula (rPD) on Formula subject to any given constraint also maximizes r PD on Formula ' subject to that same constraint, and vice versa. If the constraint is simply a limit on the number of taxa, this means that any as Formula -> 1 NAP can be solved by solving the corresponding 0 Formula -> 1 NAP on the transformed tree Formula ', which can be done with a simple greedy algorithm (Steel, 2005; Pardi and Goldman, 2005). This result was also shown by Hartmann and Steel (2007), who independently derived the same branch length rescaling described above. The tractability of the as Formula -> 1 NAP with a greedy algorithm also derives from its being a subproblem of the greedy-tractable 1–qs Formula -> 1–{kappa} qs NAP (Hartmann and Steel, 2006).

More generally, if the constraint is of the cost-budget type, the above means that any as Formula 1 NAP can be reduced to the corresponding 0 Formula 1 NAP on the transformed tree Formula ', which can be solved with our base algorithm. Being able to solve the rather general as Formula 1 NAP is a novel result, and we suspect that the applicability of dynamic programming techniques to the NAP is even wider.

Notice also that, even more generally, any as Formula bs NAP (problem 2) can be reduced to a NAP with the form 0 Formula bs'. Constructing Formula ' as before, a similar proof to that above shows that solving any as Formula bs NAP on Formula is equivalent to solving the corresponding 0 Formula (bsas)/(1 – as) NAP on Formula '. Even though we cannot use this reduction to solve the general as Formula bs NAP (as realistic algorithms to solve the 0 Formula bs NAP are not currently known), this shows that all the difficulty of the general NAP is somehow already present in the 0 Formula bs NAP.

It is important to realize that our dynamic programming algorithm cannot be applied or adapted to the general as Formula bs NAP (or the 0 Formula bs NAP), as these problems do not have optimal substructure (it is not difficult to construct an example where the solution for a clade Formula is not obtainable as the union of solutions for Formula 's subclades). One exception to this is the 0 Formula -> b NAP, which does have optimal sustructure and therefore can be solved with a simple adaptation of our base dynamic programming algorithm. As the 0 Formula -> b NAP can hardly have any practical importance, we do not describe this adaptation here, but this observation allows us to show that, interestingly, the class of (known) greedy-tractable subproblems of the NAP (Hartmann and Steel, 2006) is contained in the one tractable with dynamic programming algorithms: the 0 Formula 1 NAP applied to an ultrametric tree is a subproblem of the 0 Formula 1 NAP tout court, and any 1–qs Formula -> 1–{kappa} qs NAP can be reduced (following our observation in the preceding paragraph) to a corresponding 0 Formula -> 1–{kappa} NAP, also solvable (as just mentioned) with dynamic programming.

Finally, note that problem (2) assumes a rooted definition of PD. Although we are not aware of work on the NAP using u PD instead of r PD (which we call u PD-NAP), it is natural to ask whether the techniques presented here (and in other papers such as Hartmann and Steel, 2006) can also be applied to the u PD-NAP. Using the ideas in this and the previous section, it is possible to solve any as Formula 1 u PD-NAP. Its solution is either empty or equal to {s}{cup} R, where R is a solution to the as Formula 1 NAP on Formula s with budget Bcs (which we can solve as we just showed). Therefore, the problem can be solved by rooting the tree in each taxon in turn; for each rooting, transform Formula s in the way described before (this transformation is dependent on the position of the root, so we will get a different tree each time), and solve the resulting 0 Formula 1 NAP with budget Bcs. Denoting by Ri the solutions obtained when rooting in each taxon si, then a solution to the as Formula 1 u PD-NAP is among {s1}{cup} R1,{s2}{cup} R2,...,{sn}{cup} Rn and can be found by simply comparing their Formula (uPD). Note that the time complexity of this algorithm will now be O(n2B2), as the 0 Formula 1 NAP will have to be solved on n different trees.


    Discussion
 Top
 Abstract
 A Dynamic Programming Algorithm...
 The Unrooted Case
 Applications to the Noah's...
 Discussion
 Appendix
 Acknowledgments
 References
 
Firstly, we have presented a new formalization of the problem of selecting taxa to maximize the retained phylogenetic diversity. The basic advantage, compared to past formalizations (Steel, 2005; Pardi and Goldman, 2005), is the possibility of taking into account the different resources required by the different taxa, which we assume to be quantifiable into taxon-specific costs.

Secondly, we have given algorithms that solve this problem for two definitions of PD. These algorithms run in O(nB2) time and use O(nB) memory, where n is the number of taxa to be chosen from and B is the budget expressed as the number of units of a relevant resource (e.g., currency). The fact that the computational complexity depends on B may seem problematic: for example, if the underlying resource is money, the budget could be a very large number (of the order of the millions, if measured in common currencies) and O(nB2)-time algorithms would be unusable in practice. To avoid this, B and all costs should be preprocessed and expressed as multiples of a large unit (such as their greatest common divisor, efficiently found with a generalization of Euclid's algorithm; Cormen et al., 2001). For example, in the case of conservation projects with budgets of the order of millions of euros, we may express everything in multiples of 10,000. This is probably precise enough to satisfy the accountants, makes B of the order of hundreds and permits the algorithms to run quickly. In order to check their efficiency, we implemented our algorithms (C++ code available via http://www.ebi.ac.uk/goldman/rats) and found that for values of B of the order of the thousands, our program still only takes few seconds per clade (1.7-GHz Pentium processor with 512 MB of RAM), and up to few minutes per clade for B = 100,000. Note that these are the times needed to process one clade (i.e., fill its row in the solution table) and that they do not depend on the size of clades (because each nontrivial clade will require the same O(B2) operations to be processed). The total running time is obtained by multiplying the clade-processing time by the number of clades (2n–1 in the rooted case) and therefore depends on the number of taxa n but is independent of the tree shape.

Thirdly, we have shown that many cases of another formalization of taxon selection, the Noah's Ark Problem (NAP), can be transformed into instances of the problems solvable with our algorithms. We are currently working on the application of dynamic programming techniques to a form of the NAP that allows for more general relationships between conservation expenditure and probability of survival.

Although we realize that optimization problems such as the one formulated here (or even the NAP) are mainly of theoretical interest currently, it is still meaningful to ask how realistic and complete they are in including factors relevant to the selection of taxa. Many factors other than PD can be incorporated into problem (1), and the NAP, by adding to the length of each terminal branch a term quantifying the taxon's importance in relation to those factors (Steel, 2005; Pardi and Goldman, 2005)—this term coincides with what Weitzman (1998) calls the species' utility. Whereas for comparative genomics this approach seems capable of dealing with most selection criteria (Pardi and Goldman, 2005), for biodiversity conservation one of the most important factors, the actual probability of extinction, is much better dealt with by the NAP.

An important question that may be asked regarding the realism of our problem (and the NAP) is whether PD is a good guiding criterion for taxon selection. For comparative genomics, answering this question will involve investigating the relationship between uPD(S) and the statistical power (of tests on evolutionary processes) that results from comparing the sequences in S. Along the lines already considered by Eddy (2005) and McAuliffe et al. (2005), it will be interesting to tackle questions such as: Under which circumstances is PD maximization not the best way to ensure high statistical power? Do these circumstances ever arise in practice? Is it possible to define an alternative measure of sequencing worth that more closely reflects power? Are the answers to these questions dependent on which statistical test is going to be carried out?

As for conservation biology, we note that PD is not the only measure of conservation worth that has been proposed (reviewed by Crozier, 1997). A widespread characteristic of proposed measures is that, as May recommended in his seminal note (May, 1990), they give central importance to taxonomic (phylogenetic) relationships. Because they try to formalize the intuitive notion of diversity on a phylogenetic tree, many of them are mathematically related. Genetic diversity (GD) (Crozier, 1992, 1997) is defined as the probability that the set of taxa preserves more than one allele per site. A branch a in the tree is labeled with pa, the probability that an allele changes in the transition from one end of branch a to the other. Then GD(S) = 1–proda(1–pa), where the product is over all branches a that are preserved (i.e., all branches in the smallest unrooted tree connecting all taxa in S). GD turns out to be strictly related to u PD: if we assume that pa and the branch length ta are simply related through pa = 1–e{kappa} ta, which is typical of sites where the number of alleles is so high that it is practically impossible to change back to a previously held state, then GD(S) = 1–prodae{kappa} ta = 1–e{kappa}{sum}ata = 1–e{kappa}· uPD(S). Therefore, as GD grows monotonically with uPD, maximizing GD is equivalent to maximizing uPD. This result was suggested without proof by Crozier (1997).

Even species richnessGaston and Spicer, 1998 (the number of different species), which is probably still the most common measure of biodiversity for conservation, is mathematically related to PD: it is simply equivalent to r PD in an ultrametric star tree. Therefore, all the results presented here can be extended to the maximization of (expected) species richness (but the optimization problems are greatly simplified and better algorithms exist). The number of taxa conserved can also be used as a secondary criterion to discriminate among equally good choices of taxa (i.e., those with equal PD and total cost) and this (or even other criteria) can be easily incorporated in our algorithms by suitably setting {iota}Formula (b) when multiple choices for this value produce the same r PD.

Furthermore, measures of diversity mathematically equivalent to PD are likely to arise in many other fields—e.g., ecology (Petchey and Gaston, 2002a, 2002b) and social sciences (Nehring and Puppe, 2002, 2003)—whenever a tree is a good way to represent relationships (often nonevolutionary) among the objects of study. For example, a tree ("dendrogram") is often the output of data clustering techniques (Everitt et al., 2001; Eisen et al., 1998). Our results may have even wider applicability than we have suggested here.

An important shortcoming of the NAP is that species are assumed to survive or become extinct independently from one another, whereas in reality strong interdependencies may exist (for example, between predators and prey; van der Heide et al., 2005; Witting and Loeschcke, 1995; Witting et al., 2000). These dependencies can in principle be formalized, but the resulting optimization problem is likely to be a very difficult one. In particular, we doubt that dynamic programming approches such as ours will be effective: in order for a problem to have optimal substructure, a degree of independence between its parts is needed. Heuristic approaches such as hill-climbing or evolutionary algorithms may be preferable.

Note that the assumption of independence may be justified for exsitu conservation—removal and protection of a small number of individuals from their environment does not affect the survival of other taxa—but clearly not for in situ conservation (van der Heide et al., 2005). In particular, when protection is applied to a geographical area rather than to a species or population, survival probabilities are raised simultaneously for the entire group of taxa that lives in that area. Interestingly, this leads quite naturally to a generalization of the NAP in which selection is applied to a number of potential nature reserves, each defining a set of changes in survival probablities. Leaving a more precise formulation to the Appendix, we may call this the Nature Reserve Problem (NRP). When each reserve only contributes towards the survival of a single taxon, the NRP reduces to the NAP.

The problem of selecting reserves with the aim of preserving biodiversity is a well-established research topic in biodiversity conservation (e.g.,] Diamond, 1975; Higgs and Usher, 1980; Pressey et al., 1993; Dobson et al., 1997; Ando et al., 1998; Howard et al., 1998; Margules and Pressey, 2000; Rodrigues et al., 2004), and recently there has been a lot of interest in defining and solving (often heuristically) optimization problems for reserve selection (e.g., Margules et al., 1988; Cocks and Baird, 1989; Underhill, 1994; Camm et al., 1996; Church et al., 1996; Csuti et al., 1997; Polasky et al., 2000; Cabeza and Moilanen, 2001; Rodrigues and Gaston, 2002; Önal and Briers, 2003). Many (if not most) of the past formalizations would simply become special cases of the NRP we propose: for example, the PD-maximization problem of Rodrigues and Gaston (2002) could be seen as the 0 Formula -> 0/1 NRP, and the expected species richness maximization problem of Polasky et al. (2000) as the 0 Formula -> brs NRP applied to an ultrametric star tree (see Appendix for the definition of the notation used here). The NRP would thus unify the (arguably most successful) approaches from economics and biology to the problem of biodiversity conservation, thus becoming a fertile meeting ground for the exchange of new ideas between these two disciplines.


    Appendix
 Top
 Abstract
 A Dynamic Programming Algorithm...
 The Unrooted Case
 Applications to the Noah's...
 Discussion
 Appendix
 Acknowledgments
 References
 
The Algorithm for the Unrooted Case
Here, we define a clade as any rooted subtree Formula of Formula X consisting of (a) a branch a, with one of its ends being the root of Formula , and (b) everything else in Formula X that lies on the other (i.e., nonroot) side of a. This definition includes all the clades in the previous sense, but whereas before each branch identified one single clade, now for each branch there are two clades, each rooted by one of its ends. For example, the branch of length 3.0 in Figure 3 is at the "top" of both clades (A,B) and (C,(D,E)). Assuming that Formula X is originally unrooted (the position of any root has no relevance for u PD), there are exactly 2(2n–3) clades in Formula X. In particular, note that the Formula s, the versions of Formula rooted in each taxon s, are now clades.

Although the collection of clades is now nonhierarchical, the fundamental properties of a single clade have not changed: a clade is still a rooted tree with one branch departing from its root and with its leaves being a subset of the leaves of Formula X. Therefore, the rooted subproblems defined before for all clades Formula and sub-budgets b ≤ B remain well defined. We are particularly interested in solving the subproblems for the taxon-rooted clades Formula s and sub-budgets Bcs because, as previously observed, some of their solutions Rs provide us with solutions {s}{cup} Rs to the unrooted problem.

All the rooted subproblems will again be implicitly solved by incrementally deriving {lambda}Formula (b) and {iota}Formula (b) for all Formula and b. As we showed before, the calculation of these values is either straightforward or directly obtainable from the corresponding values for Formula 's subclades. It is therefore necessary to tackle the clades in an order that guarantees that subclades are met before their "superclades." Whereas before this was trivially satisfied by a bottom-up traversal of all clades, now a slightly more complex approach is needed. First, a node in Formula X is arbitrarily chosen as top node of the tree (different choices of this node, which can be a leaf, lead to different orderings of the clades, but all produce the same final results). This determines a way to picture the tree (imagine it redrawn with the top node uppermost and all branches descending from there) and we can then classify clades into downward clades—those consisting of a branch and everything below it—and upward clades—the remaining ones. For example, in Figure 3 the top node is set to the position highlighted in pink; as a result, (A,B) is a downward clade and (C,(D,E)) an upward clade.

The {lambda} and {iota} values are calculated for downward clades first, going in the usual bottom-up order. In Figure 3, where again we imagine that all the results are stored in a solutions table, this corresponds to filling the bottom half of the table, starting from the bottom row and going upwards. The results for the upward clades are then derived in a top-down fashion: more precisely, we can imagine that all the branches in Formula X are visited in a top-down order and each time a branch is visited, the {lambda} and {iota} values for the corresponding upward clade are computed. In Figure 3, upward clades are visited in the following order: 1: (C,(D,E)); 2: (B,(C,(D,E))); 3: (A,(C,(D,E))); 4: ((A,B),(D,E)); 5: ((A,B),C); 6: (((A,B),C),E); 7: (((A,B),C),D), which corresponds to filling the top half of the solutions table from bottom to top. This ordering of the clades guarantees that whenever we derive solutions for a clade Formula containing subclades Formula and Formula , the solutions for Formula and Formula have already been calculated.

Once all the {lambda} and {iota} values have been calculated, we turn our attention to the values {lambda}Formula s(Bcs) for all taxa s. By definition, these are equal to rPD(Rs), where Rs is a solution to the rooted problem with budget Bcs applied to Formula s, and therefore also equal to the u PD of the candidate solutions {s}{cup} Rs to the unrooted problem. Clearly, the taxa s that maximize {lambda}Formula s(Bcs) are those contained in an optimal solution to the unrooted problem (in Figure 3, all of them except B). If we pick any one of these taxa s and reconstruct Rs by using the {iota}Formula values starting from {iota}Formula s(Bcs)—as described for the rooted case or, equivalently, with a call to reconstruct(Formula s,Bcs)—we therefore also obtain an optimal solution {s}{cup} Rs to the unrooted problem. For example, if in Figure 3 we pick taxon C, following the {iota}Formula values (indicated in the figure by dotted arrows) leads to rooted solution {A,D,E} and therefore {A,C,D,E} is a solution to the unrooted problem.

Note that whereas in this example {A,C,D,E} coincides with the set of taxa that maximize {lambda}Formula s(Bcs), in general the latter will not necessarily coincide with an optimal solution but rather with the union of all optimal solutions to the unrooted problem.

This concludes the description of our algorithm for maximizing u PD subject to cost constraints. Again, a more concise description is given by the pseudocode in algorithms 1–5: a solution to the unrooted problem is simply obtained by a call to double traversal(Formula ). The cheapest optimal solution can be obtained in a similar way to that described above: just try to reduce the budget below Bcs for all taxa s that maximize {lambda}Formula s(Bcs); one of the taxa that lead to the largest reduction should be used as starting point for the reconstruction of the minimal-cost u PD-optimal set of taxa.

Pseudocode
Problem (1) with the rooted definition for PD is solved by a call to bottom-up(Formula X), which fills up the solutions table, followed by a call to reconstruct(Formula X,B), which returns an optimal solution (not necessarily of minimum cost, but this can easily be implemented).

Problem (1) with the unrooted definition for PD is solved by a call to double traversal(Formula ) (where rev(Formula ) denotes the clade rooted in the same branch as Formula but oriented towards the opposite side).

A simple improvement to these algorithms can be obtained by noting that the rows of the solutions table do not always need to be filled up until their last column. If a clade Formula is such that the total sum of the costs for its taxa, which we may denote by CFormula , is smaller than B, then its {lambda}Formula (b) and {iota}Formula (b) values need only be calculated for b ≤ CFormula : the solutions for b > CFormula are necessarily the same as the one for b = CFormula and simply consist of taking all taxa in Formula . Procedures calculate(Formula ) and reconstruct(Formula ,b) can be simply modified accordingly. This results in some saving of running time, although the time complexity remains O(nB2).


Figure 5
View larger version (4K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
 


Figure 6
View larger version (11K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
 


Figure 7
View larger version (6K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
 


Figure 8
View larger version (5K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
 


Figure 9
View larger version (4K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
 
The Nature Reserve Problem
Imagine having, in addition to the species in Formula , a set Formula of potential reserve sites. A pair (r,s), with risinFormula and sisinFormula , identifies the (possibly nonexistent) population of species s in site r. Imagine that we have control over the matrix (prs) of survival probabilities of the populations (r,s) (where by "survival" we mean existence at some specified future time). In analogy to the Noah's Ark Problem, these probabilities can be chosen from two specified values: we have two matrices of probabilities (ars), (brs), and prs will be set to brs or ars depending on whether site r is conserved or not, respectively. Note that we may have ars > brs, when for example species s may benefit from the human exploitation of site r. For each risinFormula , we also have a measure cr of the cost of conserving site r; i.e., of including it in a nature reserve. The objective is to construct a set of reserves RsubeqFormula , with {sum}risin Rcr ≤ B, that maximizes the expected PD resulting from conservation of the sites in R.

The Nature Reserve Problem, which we may write ars Formula brs NRP, is a generalization of many problems defined in the past. In addition to the ones already mentioned, we note that the Budgeted Maximum Coverage Problem (Khuller et al., 1999) coincides with the 0 Formula 0/1 NRP applied to a star tree—where by writing 0/1 on the right of the arrow we mean that each survival probability brs is constrained to equal 0 or 1.

Taxa in Figure 4
Species abbreviations are Dma: Daubentonia madagascariensis; Mmy: Microcebus myoxinus; Mru: Microcebus rufus; Mra: Microcebus ravelobensis; Mmu: Microcebus murinus; Mco: Mirza coquereli; Atr: Allocebus trichotis; Cma: Cheirogaleus major; Cme: Cheirogaleus medius; Pfu el: Phaner furcifer electromontis; Pfu pl: Phaner furcifer pallescens; Pfu pr: Phaner furcifer parienti; Pfu fu: Phaner furcifer furcifer; Lmu: Lepilemur mustelinus; Ldo: Lepilemur dorsalis; Lse: Lepilemur septentrionalis; Led: Lepilemur edwardsi; Lmi: Lepilemur microdon; Lru: Lepilemur ruficaudatus; Lle: Lepilemur leucopus; Pta: Propithecus tattersalli; Pco: Propithecus coquereli; Pve ve: Propithecus verreauxi verreauxi; Pve de: Propithecus verreauxi deckeni; Pve co: Propithecus verreauxi coronatus; Pdi: Propithecus diadema; Ppe: Propithecus perrieri; Pca: Propithecus candidus; Ped: Propithecus edwardsi; Iin: Indri indri; Ala: Avahi laniger; Aoc: Avahi occidentalis; Acl: Avahi cleesei; Vva va: Varecia variegata variegata; Vva ru: Varecia variegata rubra; Hgr al: Hapalemur griseus alaotrensis; Hgr oc: Hapalemur griseus occidentalis; Hgr gr: Hapalemur griseus griseus; Hau: Hapalemur aureus; Hsi: Hapalemur simus; Lca: Lemur catta; Ema ma: Eulemur macaco macaco; Ema fl: Eulemur macaco flavifrons; Eco: Eulemur coronatus; Eru: Eulemur rubriventer; Emo: Eulemur mongoz; Efu co: Eulemur fulvus collaris; Efu ac: Eulemur fulvus albocollaris; Efu ru: Eulemur fulvus rufus; Efu fu: Eulemur fulvus fulvus; Efu af: Eulemur fulvus albifrons; Efu sa: Eulemur fulvus sanfordi. The phylogenetic tree used in Figure 4a and the species' conservation costs and survival probabilities are available from http://www.ebi.ac.uk/goldman/ rats.


    Acknowledgments
 Top
 Abstract
 A Dynamic Programming Algorithm...
 The Unrooted Case
 Applications to the Noah's...
 Discussion
 Appendix
 Acknowledgments
 References
 
We thank Klaas Hartmann, David MacKay, and two anonymous reviewers for their many helpful comments. F.P. is a member of St Catharine's College, University of Cambridge.


    References
 Top
 Abstract
 A Dynamic Programming Algorithm...
 The Unrooted Case
 Applications to the Noah's...
 Discussion
 Appendix
 Acknowledgments
 References
 

    Ando A., Camm J., Polasky S., Solow A. Species distributions, land values, and efficient conservation. Science (1998) 279:2126–2128.[Abstract/Free Full Text]

    Andriaholinirina N., Fausser J., Roos C., Zinner D., Thalmann U., Rabarivola C., Ravoarimanana I., Ganzhorn J., Meier B., Hilgartner R., Walter L., Zaramody A., Langer C., Hahn T., Zimmermann E., Radespiel U., Craul M., Tomiuk J., Tattersall I., Rumpler Y. Molecular phylogeny and taxonomic revision of the sportive lemurs (Lepilemur, Primates). BMC Evol. Biol. (2006) 6:17.[CrossRef][Medline]

    Barker G. Phylogenetic diversity: A quantitative framework for measurement of priority and achievement in biodiversity conservation. Biol. Linn. Soci. (2002) 76:165–194.[CrossRef]

    Cabeza M., Moilanen A. Design of reserve networks and the persistence of biodiversity. Trends Ecol. Evol. (2001) 16:242–248.[CrossRef][Medline]

    Camm J., Polasky S., Solow A., Csuti B. A note on optimal algorithms for reserve site selection. Biol. Conserv. (1996) 78:353–355.[CrossRef]

    Church R., Stoms D., Davis F. Reserve selection as a maximal covering location problem. Biol. Conserv. (1996) 76:105–112.[CrossRef]

    Cocks K., Baird I. Using mathematical programming to address the multiple reserve selection problem: An example from the Eyre Peninsula, South Australia. Biol. Conserv. (1989) 49:113–130.[CrossRef]

    Cormen T., Leiserson C., Rivest R., Stein C. Introduction to algorithms (2001) 2nd edition. Cambridge, Massachusetts: MIT Press.

    Crozier R. Genetic diversity and the agony of choice. Biol. Conserv. (1992) 61:11–15.[CrossRef]

    Crozier R. Preserving the information content of species: Genetic diversity, phylogeny, and conservation worth. Annu. Rev. Ecol. Syst. (1997) 28:243–268.[CrossRef][Web of Science]

    Crozier R., Agapow P., Dunnett L. Conceptual issues in phylogeny and conservation: A reply to Faith and Baker. Evol. Bioinformatics Online (2006) 2:197–199.

    Csuti B., Polasky S., Williams P., Pressey R., Camm J., Kershaw M., Kiester A., Downs B., Hamilton R., Huso M., Sahr K. A comparison of reserve selection algorithms using data on terrestrial vertebrates in Oregon. Biol. Conserv. (1997) 80:83–97.[CrossRef]

    Diamond J. The island dilemma: lessons of modern biogeographic studies for the design of natural reserves. Biol. Conserv. (1975) 7:129–146.[CrossRef]

    Dobson A., Rodriguez J., Roberts W., Wilcove D. Geographic distribution of endangered species in the United States. Science (1997) 275:550–553.[Abstract/Free Full Text]

    Eddy S. A model of the statistical power of comparative genome sequence analysis. PLoS Biol. (2005) 3:e10.[CrossRef][Medline]

    Eisen M., Spellman P., Brown P., Botstein D. Cluster analysis and display of genome-wide expression patterns. Proc. Nat. Acad. Sci. USA (1998) 95:14863–14868.[Abstract/Free Full Text]

    Everitt B., Landau S., Leese M. Cluster analysis (2001) 4th edition. London: Arnold.

    Faith D. Conservation evaluation and phylogenetic diversity. Biol. Conserv. (1992) 61:1–10.[CrossRef]

    Faith D., Baker A. Phylogenetic diversity (PD) and biodiversity conservation: Some bioinformatics challenges. Evol. Bioinformatics Online (2006) 2:70–77.

    Forest F., Grenyer R., Rouget M., Davies T., Cowling R., Faith D., Balmford A., Manning J., Proches S., van der Bank M., Reeves G., Hedderson T., Savolainen V. Preserving the evolutionary potential of floras in biodiversity hotspots. Nature (2007) 445:757–760.[CrossRef][Medline]

    Garey M., Johnson D. Computers and intractability: A guide to the theory of NP-completeness (1979) New York: W.H. Freeman and Co.

    Gaston K., Spicer J. Biodiversity: An introduction (1998) Oxford, UK: Blackwell Science.

    Hartmann K., Steel M. Maximizing phylogenetic diversity in biodiversity conservation: Greedy solutions to the Noah's Ark Problem. Syst. Biol. (2006) 55:644–651.[Abstract/Free Full Text]

    Hartmann K., Steel M. Phylogenetic diversity: From combinatorics to ecology. In: Reconstructing evolution: New mathematical and computational advances—Gascuel O., Steel M., eds. (2007) Oxford University Press.

    Higgs A., Usher M. Should nature reserves be large or small? Nature (1980) 285:568–569.[CrossRef][Web of Science]

    Howard P., Viskanic P., Davenport T., Kigenyi F., Baltzer M., Dickinson C., Lwanga J., Matthews R., Balmford A. Complementarity and the use of indicator groups for reserve selection in Uganda. Nature (1998) 394:472–475.[CrossRef][Web of Science]

    IUCN. 2006 IUCN Red List of Threatened Species (2006) http://www.iucnredlist.org(accessed 30 December 2006).

    Khuller S., Moss A., Naor J. The budgeted maximum coverage problem. Inform. Process. Lett. (1999) 70:39–45.[CrossRef]

    Mace G., Gittleman J., Purvis A. Preserving the Tree of Life. Science (2003) 300:1707.[Abstract/Free Full Text]

    Margules C., Nicholls A., Pressey R. Selecting networks of reserves to maximize biological diversity. Biol. Conserv. (1988) 43:63–76.[CrossRef]

    Margules C., Pressey R. Systematic conservation planning. Nature (2000) 405:243–253.[CrossRef][Medline]

    Margulies E., Vinson J., Miller W., Ja–e D., Lindblad-Toh K., Chang J., Green E., Lander E., Mullikin J., Clamp M. An initial strategy for the systematic identification of functional elements in the human genome by low-redundancy comparative sequencing. Proc. Nat. Acad. Sci. USA (2005) 102:4795–4800.[Abstract/Free Full Text]

    May R. Taxonomy as destiny. Nature (1990) 347:129–130.[CrossRef][Web of Science]

    McAuliffe J., Jordan M., Pachter L. Subtree power analysis and species selection for comparative genomics. Proc. Nat. Acad. Sci. USA (2005) 102:7900–7905.[Abstract/Free Full Text]

    Minh B., Klaere S., von Haeseler A. Phylogenetic diversity within seconds. Syst. Biol. (2006) 55:769–773.[Abstract/Free Full Text]

    Moritz C., Faith D. Comparative phylogeography and the identification of genetically divergent areas for conservation. Mol. Ecol. (1998) 7:419–429.[CrossRef]

    Nee S., May R. Extinction and the loss of evolutionary history. Science (1997) 278:692–694.[Abstract/Free Full Text]

    Nehring K., Puppe C. A theory of diversity. Econometrica (2002) 70:1155–1198.[CrossRef][Web of Science]

    Nehring K., Puppe C. Diversity and dissimilarity in lines and hierarchies. Math. Social Sci. (2003) 45:167–183.[CrossRef]

    Önal H., Briers R. Selection of a minimum-boundary reserve network using integer programming. Proc. R. Soc. B Biol. Sci. (2003) 270:1487–1491.[CrossRef][Medline]

    Pardi F., Goldman N. Species choice for comparative genomics: No need for cooperation. PLoS Genetics (2005) 1:e71.[CrossRef]

    Pastorini J., Forstner M., Martin R. Phylogenetic relationships among Lemuridae (Primates): Evidence from mtDNA. J. Hum. Evol. (2002) 43:463–478.[Web of Science][Medline]

    Pastorini J., Martin R., Ehresmann P., Zimmermann E., Forstner M. Molecular phylogeny of the lemur family Cheirogaleidae (Primates) based on mitochondrial DNA sequences. Mol. Phylogenet. Evol. (2001) 19:45–56.[CrossRef][Web of Science][Medline]

    Petchey O., Gaston K. Extinction and the loss of functional diversity. Proc. R. Soc. B Biol. Sci. (2002a) 269:1721–1727.[CrossRef][Medline]

    Petchey O., Gaston K. Functional diversity (FD), species richness and community composition. Ecol. Lett. (2002b) 5:402–411.[CrossRef][Web of Science]

    Polasky S., Camm J., Solow A., Csuti B., White D., Ding R. Choosing reserve networks with incomplete species information. Biol. Conserva. (2000) 94:1–10.[CrossRef]

    Pressey R., Humphries C., Margules C., Vane-Wright R., Williams P. Beyond opportunism: Key principles for systematic reserve selection. Trends Ecol. Evol. (1993) 8:124–128.[CrossRef]

    Reist-Marti S., Simianer H., Gibson J., Hanotte O., Rege J. Weitzman's approach and conservation of breed diversity: An application to African cattle breeds. Conserv. Biol. (2003) 17:1299–1311.[CrossRef][Web of Science]

    Rodrigues A., Andelman S., Bakarr M., Boitani L., Brooks T., Cowling R., Fishpool L., da Fonseca G., Gaston K., Hoffmann M., Long J. S., Marquet P. A., Pilgrim J. D., Pressey R. L., Schipper J., Sechrest W., Stuart S. N., Underhill L. G., Waller R. W., Watts M. E. J., Yan X. Effectiveness of the global protected area network in representing species diversity. Nature (2004) 428:640–643.[CrossRef][Medline]

    Rodrigues A., Gaston K. Maximizing phylogenetic diversity in the selection of networks of conservation areas. Biol. Conserv. (2002) 105:103–111.[CrossRef]

    Roos C., Schmitz J., Zischler H. Primate jumping genes elucidate strepsirrhine phylogeny. Proc. Nat. Acad. Sci. USA (2004) 101:10650–10654.[Abstract/Free Full Text]

    Simianer H., Marti S., Gibson J., Hanotte O., Rege J. An approach to the optimal allocation of conservation funds to minimize loss of genetic diversity between livestock breeds. Ecol. Econ. (2003) 45:377–392.[CrossRef]

    Steel M. Phylogenetic diversity and the greedy algorithm. Syst. Biol. (2005) 54:527–529.[Abstract/Free Full Text]

    Thomas J., Touchman J., Blakesley R., Bouffard G., Beckstrom-Sternberg S., Margulies E., Blanchette M., Siepel A., Thomas P., McDowell J., et al. Comparative analyses of multi-species sequences from targeted genomic regions. Nature (2003) 424:788–793.[CrossRef][Medline]

    Underhill L. Optimal and suboptimal reserve selection algorithms. Biol. Conserv. (1994) 70:85–87.[CrossRef]

    van der Heide C., van den Bergh J., van Ierland E. Extending Weitzman's economic ranking of biodiversity protection: Combining ecological and genetic considerations. Ecol. Econ. (2005) 55:218–223.[CrossRef]

    Vane-Wright R., Humphries C., Williams P. What to protect?—Systematics and the agony of choice. Biol. Conserv. (1991) 55:235–254.[CrossRef]

    Weitzman M. On diversity. Q. J. Econ. (1992) 107:363–405.[CrossRef][Web of Science]

    Weitzman M. What to preserve? An application of diversity theory to crane conservation. Q. J. Econ. (1993) 108:157–183.[CrossRef][Web of Science]

    Weitzman M. The Noah's Ark Problem. Econometrica (1998) 66:1279–1298.[CrossRef][Web of Science]

    Witting L., Loeschcke V. Biodiversity conservation: Reserve optimization or loss minimization? Trends Ecol. Evol. (1993) 8:417.[CrossRef]

    Witting L., Loeschcke V. The optimization of biodiversity conservation. Biol. Conserv. (1995) 71:205–207.[CrossRef]

    Witting L., Tomiuk J., Loeschcke V. Modelling the optimal conservation of interacting species. Ecol. Model. (2000) 125:123–143.[CrossRef][Web of Science]

    Yoder A. Back to the future: A synthesis of strepsirrhine systematics. Evol. Anthropol. Issues News Rev. (1997) 6:11–22.[CrossRef]

    Yoder A., Rasoloarison R., Goodman S., Irwin J., Atsalis S., Ravosa M., Ganzhorn J. Remarkable species diversity in Malagasy mouse lemurs (primates, Microcebus). Proc. Nat. Acad. Sci. USA (2000) 97:11325–11330.[Abstract/Free Full Text]


Add to CiteULike CiteULike   Add to Connotea Connotea   Add to Del.icio.us Del.icio.us    What's this?


This article has been cited by other articles:


Home page
Syst BiolHome page
B. Q. Minh, S. Klaere, and A. von Haeseler
Taxon Selection under Split Diversity
Syst Biol, December 1, 2009; 58(6): 586 - 594.
[Abstract] [Full Text] [PDF]


This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (PDF) Freely available
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in ISI Web of Science
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrow Search for citing articles in:
ISI Web of Science (7)
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Pardi, F.
Right arrow Articles by Goldman, N.
Right arrow Search for Related Content
PubMed
Right arrow Articles by Pardi, F.
Right arrow Articles by Goldman, N.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?