© 2007 Society of Systematic Biologists
Reconstructing Evolution: New Mathematical and Computational Advances—Olivier Gascuel and Mike Steel (editors). 2007. Oxford University Press, Oxford. xxix + 318 pp. ISBN 978-0-19-920822-7 (ISBN-10 0-19-920822-0). £39.50, $80.00 (hardback).
1 Department of Parasitology (SWEPAR), National Veterinary Institute and Swedish University of Agricultural Sciences 751 89, Uppsala, Sweden E-mail: David.Morrison{at}bvf.slu.se
Reconstructing Evolution: New Mathematical and Computational Advances—Olivier Gascuel and Mike Steel (editors). 2007. Oxford University Press, Oxford. xxix + 318 pp. ISBN 978-0-19-920822-7 (ISBN-10 0-19-920822-0). £39.50, $80.00 (hardback).
Mathematicians are a strange breed, as any biologist can tell you: not only do they operate in esoteric ways, they also communicate in an arcane language. Now, science is usually a quantitative subject, and so biologists often have to apply mathematical techniques in order to process their data. It is not uncommon for this to be a problematic exercise, because people become biologists in order to cuddle koalas, not to perform arithmetic calisthenics; and yet, somehow they have to understand what the mathematicians are saying.
This situation was bad enough when all that biologists really needed was statistics, as biologists have frequently been accused of using statistics as a black box, with little real understanding of what they are doing, or even why. However, it has become much more problematic with the development of bioinformatics, as this has considerably extended the role that mathematicians can play in biological work. Indeed, books on bioinformatics are now almost as common as books on statistics. The question is: are any of them readable by biologists?
Before we can answer this question, we need to consider modern mathematics and mathematicians in a bit more detail, because mathematicians are not always in agreement even amongst themselves as to how mathematics should best be performed and/or presented for publication. More to the point, there seems to be an enormous gap between what mathematicians do and what biologists need them to do.
Mathematics is usually seen as a formal subject, in the sense that ideas are presented rigorously as propositions, theorems, proofs, remarks, and lemmas. (As a botanist, I keep looking for the paleas and glumes.) Indeed, modern mathematicians have been charged with taking David Hilbert's vision of complete formalism to an unnecessary and unproductive extreme (e.g., Byers, 2007). Perhaps this charge is less true of applied mathematics (e.g., bioinformatics) than it is of pure mathematics, but this whole approach is surely a poor way of communicating with nonmathematicians. The two biggest limitations of mathematical communication seem to be the focus on algorithms and the lack of diagrams.
Whenever I ask a mathematician to explain an analysis he usually starts describing the algorithm, but this is frequently unhelpful. What I need, instead, is to understand the rationale for that particular approach to the analysis. For example, consider the single most common form of data analysis in biology: calculation of the arithmetic mean. How many biologists actually understand this, even though they can all do it? The usual algorithm for the calculation indicates nothing at all about the rationale for using the mean as an estimate of the central location or average, as it is simply a convenient "machine formula" (unlike the median or the mode, both of which are intuitively obvious from their usual algorithms). The original mathematical justification for the arithmetic average can be expressed as a formula (from which the machine formula is derived), but some descriptive words would be even more helpful (something, perhaps, about finding the point that minimizes the "average" squared distance to the data).
Yet more helpful would be a diagram (perhaps one showing that the mean is the balance point on a see-saw). Not only is a picture worth a thousand words, it is often worth a dozen formulae as well. For instance, Ronald Fisher (who made more fundamental and practical contributions to modern statistics than anyone else) had trouble publishing his work, because he thought geometrically but was expected to explain his work algebraically (Box, 1978)—geometry is almost always best explained with a diagram. There have been many pleas for the greater use of diagrams in mathematics (e.g., Brown, 1999), but these seem to have fallen on a large number of deaf ears.
Even when words are used instead of (or in addition to) formulae, applied mathematicians may still cause confusion by misusing technical terms within science or carefully defining them in ways that do not match their original usage. Wilkinson et al. (2007) have recently highlighted the most widespread example of this in phylogenetics, in which the word "clade" is applied to taxa on an unrooted tree (see, for example, Yang, 2006). To a biologist there is a world of difference between a rooted and an unrooted tree, since only the former specifies ancestor-descendant relationships, but many mathematicians seem to see little difference at all. This probably comes from their vulgar reduction of real historical relationships to artificial stick figures with leaf colorations.
As far as publishing mathematics is concerned, it has repeatedly been pointed out that it takes longer to get a mathematical paper published than any other type of research, with an average time from submission to publication of at least 2 years for most journals (Abt, 1992; Bradlow and Wainer, 1998; Carroll, 2001; Ellison, 2002; Greenberg et al., 2004). This has created the situation where mathematics journals are sometimes seen as archives of out-of-date work. This is completely unacceptable in bioinformatics, which needs to keep pace with modern biology. There seem to have been a number of responses by mathematicians to this situation. First, there is the simple expedient of publishing their work in a biological journal, which means that a large proportion of the journal's readership is disenfranchised by the language. Second, there is the approach of first publishing the work in an online technical report (with all of the details), followed by publication as an extended conference abstract (with none of the details), while waiting for the journal version to appear (with some of the details). If a biologist tried this form of multiple publication (and listed all three on their CV) they would be blacklisted. Third, many specialist bioinformatics journal's now exist that try to publish in a timely manner. Unfortunately, the latter often have endless repeats of proofs that have been published elsewhere (how many times do we need to be shown that the least-squares sample mean is the maximum likelihood estimate of the population mean when one assumes a normal distribution?) or relegate the proofs to accessory/supplementary files on personal web pages (where they often have a short shelf life). I am not convinced that mathematics has been best served by any of these three options.
There is also a widespread failure of mathematicians to provide usable computer programs to go along with their algorithms. There is little point to developing a new algorithm if no one can actually use it in practice. There seem to be four distinct levels of response to this situation: (i) no program of any sort is provided; (ii) a command file is provided that can be used within interpreted data-analysis languages such as R/S+ or Python (with which most biologists are not familiar); (iii) a program is provided with a rudimentary user interface, sometimes compiled for a particular operating system, and sometimes provided as source code that can be compiled or used in an interpreted language such as Perl; and (iv) a program is provided with a carefully thought out user interface, either compiled for several operating systems or programmed in an interpreted language such as Java. Biologists thank their lucky stars for (iii) and especially (iv), but they curse severely for (ii) and especially (i).
Finally, it seems to be important to distinguish between mathematics and arithmetic, the latter being the application of actual numbers to the formulae developed in the former. We all do arithmetic, but only mathematicians (and computer scientists) do mathematics, and there is no reason to expect mathematicians to be any better at arithmetic than anyone else. After all, if they could do arithmetic then they would probably have become accountants instead, as this is a far more lucrative profession (and their conferences are usually held in Rio de Janeiro in summer rather than Oxford in winter). Therefore, attempting to understand a mathematical analysis by repeating the provided example is not always successful (see, for example, Steel et al., 2000).
Having considered all of these points, we are now in a position to provide the book review that I have been, by slow degrees, approaching. In light of the above, how does the book edited by Olivier Gascuel and Mike Steel fare? The short answer is: variably. A slightly longer answer is that on almost every criterion it ranges from one extreme to the other.
Publication of mathematical work in a book, rather than in a journal, does not usually imply up-to-the minute research. Instead, the work is likely to provide summaries or overviews of particular areas of mathematical endeavor, as is the intended case here. As far as timing is concerned, the conference on which this volume is based was held in mid-2005, the editors' preface is dated late 2006, and the volume appeared in mid-2007. Thus, the chapters vary in age from 1 to 2 years (most of the chapters have literature references from 2006, and one even from 2007). The book is actually a companion volume to that edited by Gascuel (2005), which was based on presentations at the same conference series 2 years previously (i.e., 2003; see the review by Winkworth, 2006).
Some of the chapters in the new book are presented as a series of formal mathematical theorems and proofs (Grünewald and Huber; Semple), whereas some of the others make more sparse use of this formalism (Allman and Rhodes; Hartmann and Steel; Huson). The other chapters are communicated in words, and not to their detriment either, as indicated above. Interestingly, one chapter appears to be free of formulae (Sanderson et al.), and this also seems to be the chapter that most clearly has biologists as its intended readers.
It is thus not entirely clear who makes up the audience for this book. Some of the chapters are easily readable by biologists, and indeed any such biologists would presumably learn quite a lot from those chapters. However, other chapters diverge too much from the language that experimental scientists speak without, it seems to me, diverging enough to be thought-provoking for mathematicians and computer scientists in the same way that the previous volume achieved (see Winkworth, 2006). In this sense, the book may unfortunately fall between two stools.
Few of the authors make use of explicit biological examples (Gascuel and Guindon; Huson; Semple), and so there is little arithmetic to worry about. Fortunately, all of the authors realize the importance of figures, and most of the concepts are illustrated in some way. Admittedly, many of these figures are stick diagrams with little obvious biological content, thus emphasizing that mathematics is free from any necessary connection with the physical world. A directed acyclic connected line graph is not necessarily a phylogeny but may instead represent something else entirely. This conceptual generality of mathematics is both its greatest strength and its greatest weakness. Applied mathematics only works when it can be brought down to earth, and in phylogenetics that happens when stick diagrams can be turned into evolutionary histories. This does not happen in too many of the chapters here.
Only two of the chapters have extensive references to computer programs (Felsenstein; Huson). Not surprisingly, these two authors have spent much of their careers writing usable (and widely used) programs, and thus presumably appreciate the importance of actually implementing their ideas. Indeed, both of these authors have produced programs (Phylip and SplitsTree, respectively) that provide a wide range of analyses within the field of phylogenetic analysis, and so they probably have a broader perspective on the needs of biologists, as well.
In spite of the book's subtitle, little of what is presented here is actually new. The chapters vary from those that present new ideas (Sanderson et al.) to those that compile previously available proofs (Semple) to those that provide overviews of recently emerging topics (Hartman and Steel; Mooers et al.; Rodrigo et al.). In this sense the book differs from its companion volume, which deliberately set out to provide comprehensive overviews of several important topics in phylogenetic analysis. This new volume gives the impression of being a much more eclectic collection of chapters. For example, three of the original conference presentations (by Daubin; Douzery; and Roest Crollius) that have previously been described as covering some of the "most exciting areas in contemporary biology" (Winkworth, 2006) are absent from this volume, along with half-a-dozen of the other presentations.
If I was going to make one general criticism of this book and its sister, it would be that "phylogenetic analysis" is seen as being entirely about constructing trees and/or networks. Homology assessment, on which such constructions must be based, plays almost no part in either book. Thus, recognizing homologous genomic fragments or aligning homologous sequence positions are rarely mentioned, in spite of the fact that these are at least as important as the array of topics that does receive extensive discussion. Moreover, the idea of simultaneously aligning sequences and building a tree is not mentioned at all.
Overall, this is an interesting collection without being an important one. There are parts of it that will appeal to particular people, depending on their specialty, but the coherence and the consistency of quality exhibited by its companion volume are missing. It should be bought by those people who have the sister volume, but it is unlikely to stand on its own. Moreover, biologists may not get as much out of it as they should.
| References |
|---|
|
|
|---|
-
Abt H. A. Publication practices in various sciences. Scientometrics (1992) 24:441–447.[CrossRef][Web of Science]
Box J., Fisher R. A. The life of a scientist (1978) New York: Wiley.
Bradlow E. T., Wainer H. Publication delays in statistics journals. Chance (1998) 11:42–45.
Brown J. R. Philosophy of mathematics: An introduction to the world of proofs and pictures (1999) London: Routledge.
Byers W. How mathematicians think: Using ambiguity, contradiction, and paradox to create mathematics (2007) Princeton, New Jersey: Princeton University Press.
Carroll R. J. Review times in statistical journals: Tilting at windmills? Biometrics (2001) 57:1–6.[CrossRef][Web of Science][Medline]
Ellison G. The slowdown of the economics publishing process. J. Political Econ. (2002) 110:947–993.[CrossRef][Web of Science]
Gascuel O., ed. Mathematics of phylogeny and evolution (2005) Oxford, UK: Oxford University Press.
Greenberg D., Rosen A. B., Olchanski N. V., Stone P. W., Nadai J., Neumann P. J. Delays in publication of cost utility analyses conducted alongside clinical trials: Registry analysis. BMJ (2004) 328:1536–1537.
Steel M., Huson D., Lockhart P. J. Invariable sites models and their use in phylogeny reconstruction. Syst. Biol. (2000) 49:225–232.
Wilkinson M., McInerney J. O., Hirt R. P., Foster P. G., Embley T. M. Of clades and clans: Terms for phylogenetic relationships in unrooted trees. Trends Ecol. Evol. (2007) 22:114–115.[CrossRef][Medline]
Winkworth R. C. [Review of] "Mathematics of phylogeny and evolution." Syst. Biol. (2006) 55:532–533.
Yang Z. Computational and molecular evolution (2006) Oxford, UK: Oxford University Press.
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||