© 2006 Society of Systematic Biologists
DNA Barcoding and Taxonomy in Diptera: A Tale of High Intraspecific Variability and Low Identification Success
Edited by Marshal Hedin: Associate Editor
Department of Biological Sciences, National University of Singapore 14 Science Drive 4, Singapore, 117543, Singapore E-mail: dbsmr{at}nus.edu.sg (R.M.)
| Abstract |
|---|
|
|
|---|
DNA barcoding and DNA taxonomy have recently been proposed as solutions to the crisis of taxonomy and received significant attention from scientific journals, grant agencies, natural history museums, and mainstream media. Here, we test two key claims of molecular taxonomy using 1333 mitochondrial COI sequences for 449 species of Diptera. We investigate whether sequences can be used for species identification ("DNA barcoding") and find a relatively low success rate (< 70%) based on tree-based and newly proposed species identification criteria. Misidentifications are due to wide overlap between intra- and interspecific genetic variability, which causes 6.5% of all query sequences to have allospecific or a mixture of allo- and conspecific (3.6%) best-matching barcodes. Even when two COI sequences are identical, there is a 6% chance that they belong to different species. We also find that 21% of all species lack unique barcodes when consensus sequences of all conspecific sequences are used. Lastly, we test whether DNA sequences yield an unambiguous species-level taxonomy when sequence profiles are assembled based on pairwise distance thresholds. We find many sequence triplets for which two of the three pairwise distances remain below the threshold, whereas the third exceeds it; i.e., it is impossible to consistently delimit species based on pairwise distances. Furthermore, for species profiles based on a 3% threshold, only 47% of all profiles are consistent with currently accepted species limits, 20% contain more than one species, and 33% only some sequences from one species; i.e., adopting such a DNA taxonomy would require the redescription of a large proportion of the known species, thus worsening the taxonomic impediment. We conclude with an outlook on the prospects of obtaining complete barcode databases and the future use of DNA sequences in a modern integrative taxonomy.
Keywords: Cytochrome oxidase I; genetic distance; pairwise distance; species identification
Received September 6, 2005; Revised December 9, 2005; Accepted April 7, 2006
Species description and identification are among the most important tasks in biology, because biologists can neither report empirical results nor access published information on a study organism until it is correctly named and/or identified. Descriptive taxonomy started in earnest in the 18th century and after 250 years, 1.5 to 1.8 million species have been described with an estimated 5 to 100 million species awaiting discovery and/or description (Wilson, 2003). Not surprisingly, taxonomic character sources have changed over the past 250 years. However, new techniques have been largely unable to prevent a much lamented crisis of taxonomy, which is now itself an endangered species (Godfray and Knapp, 2004; Wheeler, 2004). One of the most important problems for the future of alpha taxonomy is the slow rate at which taxonomists can revise, describe, and identify species. For example, it has been estimated that another 940 years will be required before all species are described using traditional techniques (Seberg, 2004). Not surprisingly, biologists are thus looking for alternatives that can speed up the process (Hogg and Hebert, 2004). Two DNA sequence-based approaches are currently receiving much attention. According to their supporters, the approaches have the potential to improve the prospects of alpha-taxonomical research (Hebert et al., 2003a; Tautz et al., 2003).
First, Hebert et al. (2003a) proposed an identification system for specimens based on "DNA barcodes." The "Barcoding Life" consortium (Hebert et al., 2004a) proposes to sequence approximately 650 bp of the mitochondrial COI gene for each species and argues that these sequences can be used for species identification. In a barcoded world, an unidentified specimen will be determined based on its COI sequence, which is matched to an identified DNA barcode from a publicly available database. Second, Tautz et al. (2002, 2003), envision a "DNA taxonomy" in which DNA sequences will ultimately provide the main scaffold for a species-level taxonomy. Examples of DNA-based taxonomy can already be found in numerous papers that use genetic distances for revising species limits (e.g., Chu et al., 1999, 2003; Shih et al., 2004; Sun et al., 2003; Tang et al., 2000, 2003; Zhi et al., 1996). DNA barcoding and DNA taxonomy have attracted much attention and discussion (e.g., Baker et al., 2003; Barrett and Hebert, 2005; Blaxter and Floyd, 2003; Dunn, 2003; Ebach and Holdrege, 2005; Ferguson, 2002; Hebert and Barrett, 2005; Hebert and Gregory, 2005; Hebert et al., 2003a, 2003b, 2004a, 2004b; Janzen, 2004; Lee, 2004; Lipscomb et al., 2003; Marshall, 2005; Moritz and Cicero, 2004; Pennisi, 2003; Prendini, 2005; Seberg, 2004; Seberg et al., 2003; Tautz et al., 2002, 2003; Will and Rubinoff, 2004; Will et al., 2005). However, the latter initially mostly focused on theoretical issues, whereas we will here test empirically whether, regardless of all theoretical challenges, DNA taxonomy and DNA barcoding can deliver reliable species-level classifications and/or species identifications.
For this purpose, we are using a data set composed of 1333 COI sequences for 449 species of Diptera, 127 of which are represented by multiple sequences. For testing DNA barcoding, we pretend for each sequence that we do not know its species identity. We then identify the query using different identification criteria (see below) and count how many identifications yield the name recorded in GenBank. For testing the feasibility of a DNA taxonomy based on pairwise distances, we assemble DNA profiles and assess whether these profiles correspond to currently accepted species.
DNA Taxonomy
Tautz et al.'s (2002, 2003) proposal of a DNA taxonomy based on sequences has received less attention than Hebert's work on DNA barcodes. This is conceivably due to the lack of detail on how a DNA taxonomy using sequences as a "scaffold" can be constructed. Simplistic approaches using pairwise distances were rejected by the authors and synthetic approaches utilizing morphological and molecular data espoused (Tautz et al., 2003). However, it remained unclear how these data could be meaningfully combined and how taxonomists would resolve conflict between data partitions. Meanwhile, many taxonomic publications using DNA sequences as the main scaffold have been published. However, these approaches using "molecular operational taxonomic units" (Blaxter, 2003; Blaxter and Floyd, 2003; Floyd et al., 2002) generally use pairwise distances for clustering sequences by similarity, although the choice of threshold value for distinguishing intra- from interspecific distances is largely arbitrary (DeSalle et al., 2005; Ferguson, 2002; Will and Rubinoff, 2004).
One additional and largely neglected problem is that the use of pairwise distances or other similarity measures can lead to logically inconsistent results. Three sequences can have two pairwise distances conforming to and one exceeding a given threshold (Fig. 1). Under this circumstance, it remains unclear whether all three sequences should be included in the same DNA species given that one distance is violating the threshold. We are here using our empirical data to test how frequently this situation is encountered in a real data set. We also assess, based on different thresholds, whether the limits of "DNA species" based on distances correspond to the currently accepted species limits.
|
DNA Barcoding
Hebert et al. (2003a) proposed that unidentified specimens can be reliably identified to species based on DNA sequences (DNA barcodes). This proposal is imminently testable and several tests have been carried out (e.g., Barrett and Hebert, 2005; Hebert et al., 2003a, 2003b, 2004b; Hogg and Hebert, 2004; Meyer and Paulay, 2005; Ward et al., 2005). However, most employed data sets that included few or no species with multiple sequences and/or few closely related species (Moritz and Cicero, 2004; Prendini, 2005; Sperling, 2003; Will and Rubinoff, 2004). It therefore remains unclear whether intraspecific and interspecific variability are sufficiently distinct for DNA barcoding to be a viable technique (Seberg, 2004; Stoeckle, 2003). Identification was furthermore based on barcodes derived from individual organisms instead of using the information from all known sequences for a given species. We are here testing whether species retain unique barcodes when we use the consensus sequence for all available conspecific sequences. Lastly, the accuracy of the tree-based species identification techniques that were utilized by Hebert et al. (2003a) have been questioned (DeSalle et al., 2005; Prendini, 2005; Will and Rubinoff, 2004; Will et al., 2005). We here address all these issues by using a data set rich in closely related sequences (see above), testing the uniqueness of consensus species barcodes, and by establishing identification success based on several different identification techniques.
Tree-Based Identification Techniques
Techniques for identifying sequences of unknown provenance are still in their infancy and generally fall into two categories. Most studies use tree-based identification tools based on neighbor-joining trees (Barrett and Hebert, 2005; Floyd et al., 2002; Hebert et al., 2003a; Tautz et al., 2003). Queries are considered successfully identified when they cluster with conspecific barcodes. There are several problems with this approach (Will and Rubinoff, 2004). Imagine a query clustering with a chimp barcode. Based on the query's position, one cannot decide whether it comes from Homo sapiens or another chimp; i.e., forming a cluster on a tree is logically insufficient for identifying a sequence (Will and Rubinoff, 2004). Yet, Barrett and Hebert (2005), Hebert (2003a), Hebert et al. (2004b) have consistently used clustering as an indication for identification success. We would argue that at best only those queries can be identified whose position unambiguously allow for the assignment of one species name to the query sequence (Will and Rubinoff, 2004). These queries are found at least one node into a clade consisting of only sequences from the same species or are part of a polytomy formed by conspecific barcodes (see Table 1).
|
A second problem with tree-based identification as currently practiced is the reliance on neighbor-joining trees (Prendini, 2005; Will and Rubinoff, 2004). Data ambiguity is here difficult to detect, because most neighbor-joining algorithms only generate a single tree even if other trees have the same fit to the data ("tie trees"; see Takezaki, 1998). Tree choice then becomes dependent on taxon entry order, which is hardly acceptable for a technique used in science (Backeljau et al., 1996). Simulation studies have furthermore revealed that "tie trees" are particularly common in trees with short internal branches (Takezaki, 1998) and these can be expected to be common on identification trees given that they usually have multiple sequences from the same or closely related species. We are here addressing these concerns by not only using neighbor-joining, but also parsimony and Bayesian analyses for reconstructing identification trees. Lastly, there are a host of conceptional problems with using trees for identification (Will and Rubinoff, 2004). For example, the "clustering" requirement assumes that species should be monophyletic and not only ignores the literature that questions the theoretical justification for this requirement (Wheeler and Meier, 2000), but also the substantial empirical evidence indicating that a large proportion of currently recognized species are "paraphyletic" on gene trees (Crisp and Chandler, 1996; Funk and Omland, 2003).
Alternative Identification Criteria
As an alternative to tree-based identification, we also use identifications based on direct sequence comparison, where a wide variety of techniques and metrics could be used and a number of new criteria and algorithms have recently been proposed (Blaxter et al., 2005; DeSalle et al., 2005; Pozhitkov and Tautz, 2002; Steinke et al., 2005). We here use three criteria (Table 1). Our first and least stringent identification criterion is "best match." Any query is assigned the species name of its best-matching barcode regardless of how similar the query and barcode sequences are. Obviously, under this criterion misidentifications are common and even unavoidable for all queries that belong to species without conspecific barcodes in the database (Will and Rubinoff, 2004). For example, the first Homo query will be most similar to a great ape barcode and thus automatically identified as such. Such mistakes are avoided by using our second identification strategy, "best close match." Here, we first identify the best barcode match of a query, but then only assign the species name of that barcode to the query if the barcode is sufficiently similar. In all other cases the query remains unidentified.
This strategy requires a threshold similarity value that defines how similar a barcode match needs to be before it can be identified. This value can be estimated for a given data set by obtaining a frequency distribution of all intraspecific pairwise distances and determining the threshold distance below which 95% of all intraspecific distances are found. If, for example, 95% of conspecific sequences have pairwise distances below 1%, then a query can only be identified according to the best close match criterion if the query has a match in the barcode data set that falls into the 0% to 1% interval. All queries without such a match would remain unidentified. The main practical drawback of this approach is that a given data set may not be representative for the taxon under investigation. For example, a very biased sample of conspecific and/or congeneric sequences could push the threshold value up or down. It also remains unclear why one should believe that there would be common threshold across species. However, in the absence of better DNA-based identification approaches, this technique at least provides a rigorously derived threshold value.
Our last criterion for identifying queries is a more rigorous application of the best close match strategy. Here we utilize information from all conspecific barcodes in the database instead of just focusing on the barcode that is most similar to the query. Imagine that the closest match for a query is a list of all known sequences for a single species. Under this circumstance, the identifier will be more confident about assigning this species name to the query than in cases where multiple species names are found on the list of best matches. Indeed, a conservative identifier would probably only assign a species name if the query is followed by all known barcodes for a particular species and insist that there are at least two conspecific matches.
Use of GenBank Data
Similar to other tests of DNA barcoding and DNA taxonomy (e.g., Barrett and Hebert, 2005; Hebert et al., 2003b), we are here utilizing sequences from GenBank, although it is known to include misidentified sequences (e.g., Harris, 2003; Hebert et al., 2003b; Ruedas et al., 2000; Seberg, 2004; Vilgalys, 2003). Could it be that our study is thus underestimating the potential of DNA barcoding? For several reasons, we believe this is not the case. First, any future barcode database will be similar to GenBank in that many researchers will submit sequences; i.e., misidentified sequences in the barcode database should be as common as they are in GenBank (Seberg, 2004). Even when better vouchering policies are implemented, it is unrealistic to believe that most vouchers will ever be reidentified by taxonomic experts and therefore only the most glaring mistakes may be found. Second, it has been proposed that DNA barcodes should be generated based on museum specimens and existing tissues in cryo-collections, although the former contain many misidentified specimens (e.g., Meier and Dikow, 2004) and the latter frequently have poor vouchering policies; i.e., DNA barcode databases based on these sources will again be similar to GenBank in containing a significant number of misidentified sequences. Third, we carefully inspected the structure of our database and found that the sequences that were misidentified using DNA barcodes had been submitted by researchers that on average provided nine conspecific and 36 congeneric sequences (Appendix 1; available online at the Society of Systematic Biology Web site, http://systematicbiology.org); i.e., the misidentified sequences came from laboratories that were carrying out phylogeographic studies. Such laboratories were probably as careful about identifying their specimens as can be expected from any future barcoder. However, authors of phylogeographic studies have yet to settle on a policy of how to submit sequences from specimens that yield unusually high intraspecific distances. Some researchers submit under the name of the original specimen identification (e.g., Frost et al., 1998), some indicate in their submission that such unusually divergent sequence comes from a closely related species (e.g., by using "cf" or "aff": Stahls et al., 2004), whereas others submit the sequence only identified to genus (e.g., Pons and Vogler, 2005). It is important to remember that data sets containing a large number of sequences from putatively cryptic species submitted under the original and potentially incorrect name could have a negative effect on the performance of DNA barcodes.
Regardless of these drawbacks of GenBank data, we believe that sequences from this database currently provide the best test databases for molecular taxonomy because only GenBank can provide data that are rich in congeneric sequences and have a similar submission profiles as future barcode databases. At the very least, they allow us to test to what degree DNA-based identification and identification based on traditional tools will come to congruent results.
| Materials and Methods |
|---|
|
|
|---|
Sequences, Alignment, and Pairwise Distances
Similar to the approach of Barrett and Hebert (2005) and Hebert et al. (2003b), we used 1443 sequences from GenBank and aligned them using ClustalX. We eliminated sequences that were (1) too short (< 300 bp, 57 sequences), (2) not identified to species (49 sequences), (3) came from species-hybridization experiments (Drosophila subquinaria GI:25990046), and (4) could not be aligned and/or translated into proteins or had > 30% sequence divergence to all other COI sequences for Diptera (Dyscritomyia robusta GI:19879668; Drosophila busckii GI:27657151; Drosophila affinis GI:27657153). Sequences with GenBank names that are synonyms according to the Biosystematic Database of World Diptera (Thompson, 2005) were renamed using the valid name (Anopheles arabiensis is junior synonym of Anopheles gambiae; 3 sequences). The remaining 1333 COI sequences came from 449 species of Diptera, of which 127 species were represented by 1011 sequences (see supplementary material; available online at http://systematicbiology.org). Sequences not belonging to COI were removed and misplaced gaps were corrected to yield a 1539-bp gap-free alignment lacking stop codons (see supplementary material; available online at http://systematicbiology.org). Most analyses were carried out using a program developed for this purpose ("TaxonDNA" available at http://taxondna.sf.net/). We carried out separate analyses for all sequences with a minimum of 300, 400, 500, and 600 overlap. Due to large interspecific distances in the very speciose genus Drosophila, we treated the Drosophila subgenera as separate genera.
In order to test for overlap between intraspecific with interspecific genetic variability, we plotted all uncorrected pairwise distances for conspecific sequences and all distances for interspecific, congeneric sequences. In order to test whether all species have unique DNA barcodes, we first tested whether identical sequences were shared by individuals from different species. We then constructed species barcodes as the consensus sequence of all conspecific sequences and again tested for the uniqueness of these species barcodes. The intraspecific sequence variability was summarized using IUPAC codes and the consensus sequence had to be based on at least two sequences.
DNA Barcoding: Tree-Based Query Identification
Using PAUP* (Swofford, 2002; Kimura 2-parameter model as recommended in Barrett and Hebert, 2005; ties broken randomly), we computed neighbor-joining trees and bootstrap trees for the largest sets of congeneric sequences with at least 300 bp overlap. The same data sets were analyzed using parsimony as implemented TNT (Goloboff et al., 2003) using New Technology Search = 15; find min. length = 3; bootstrap 250 replicates), and Bayesian analyses as implemented in MrBayes 3.1 (Huelsenbeck and Ronquist, 2003). All Bayesian searches were initiated from random starting trees. For all data sets with congeneric sequences, the GTR+I+G model was favored by the Akaike information criterion and hierarchical likelihood-ratio testing as implemented in MrModelTest version 2.2 (Nylander, 2004). The data set was run for 3,000,000 generations and a tree was sampled every 300 generations, resulting in 10,000 trees. Chain stationarity had been achieved for all data sets after 1,200,000 generations (burn-in) and 4000 trees were subsequently discarded. Three independently repeated analyses resulted in similar tree topologies and comparable clade probabilities and substitution model parameters.
For all trees, identification success was initially assessed as described in Hebert et al. (2003a) and Table 1; i.e., sequences were considered successfully identified as long as they formed species-specific clusters. Species with sequences at multiple positions in the tree were considered failures and those species with a single sequence were counted as ambiguous. Second, we used the revised identification criteria described in the introduction. We only considered queries to be correctly identified if they were found in a species-specific polytomy or at least one node into a clade exclusively consisting of sequences from one species. Ambiguous were all queries belonging to species with one or two sequences and those that formed a sister group to a cluster of conspecific sequences. We counted those sequences as misidentified that were assigned a definite but incorrect name (e.g., a query within an allospecific sequence cluster). A special case is polytomies of sequences from two species. If the query was from a different species than all remaining sequences in the polytomy, we counted the query as a misidentification because an identifier will assume that the query is conspecific with the remaining sequences. However, if the polytomy had at least two sequences each from two different species, then the query in the polytomy was considered ambiguous, because the identifier will be aware that a query in such a polytomy cannot be unambiguously identified.
DNA Barcoding: Identifying Species Based on Distances (see Table 1)
- "Best match." We used TaxonDNA to find for each query its closest barcode match. If both sequences were from the same species, the identification was considered a success, whereas mismatched names were counted as failures. Several equally good best matches from different species were considered ambiguous.
- "Best close match." We used TaxonDNA to plot the relative frequency of intraspecific distances in order to determine the threshold value below 95% of all intraspecific distances are found. All queries without barcode match below the threshold value remained unidentified. For the remaining queries, their identity was compared to the species identity of their closest barcode. If the name was identical, the query was considered an identification success. The identification was considered a failure when the names were mismatched and considered ambiguous when several equally good best matches were found that belonged to a minimum of two species.
- "All species barcodes." We assembled for each query a list of all barcodes sorted by similarity to the query using the same threshold as for best close match. Queries were considered a success when they were followed by all conspecific barcodes as long as there were at least two barcodes for the species. Queries were considered ambiguous when they were followed by only one conspecific barcode or only some of the conspecific sequences. Queries followed by all conspecific sequences from the "wrong" species were considered misidentified.
DNA Taxonomy: Profiles Based on Distance Thresholds
We tested the viability of threshold values for distinguishing intra- from interspecific variability for 2%, 3%, 4%, 5%, and 6% thresholds. For this purpose, TaxonDNA finds for each query a set of barcodes for which each sequence in the set has at least one other sequence within the threshold distance. For all clusters, we determined whether the largest observed distance exceeded the threshold distance and whether they correspond to currently accepted species (= contains all sequences for one species). If not, we determined whether it contained sequences for several species (error 1) and/or not all sequences for the same species (error 2).
| Results |
|---|
|
|
|---|
Intra- and Interspecific Variabilities
The following results apply to sequences with 300 bp overlap. Corresponding results for the subset of sequences with 400, 500, and 600 bp overlap are found in Tables 2 and 3 and document that identification success is largely unrelated to sequence overlap. We find that intraspecific and interspecific variation overlap widely in Diptera COI sequences (0% to 15.5%) with 99% of all congeneric distances falling into this interval (Table 2, Fig. 2). When the largest 5% of the intraspecific and the lowest 5% of the interspecific values are excluded, the overlap shrinks to 2.31% to 3.34% which holds 5.24% of all congeneric distances (Table 2). The pattern of intra- versus interspecific variability is very inconsistent across genera, with some essentially lacking overlap whereas others have little separation. Species barcodes sensu stricto constructed as the union-based consensus of all conspecific barcodes revealed that 34 species are involved in 22 cases of two species sharing identical barcodes (Appendix 2; available online at http://systematicbiology.org). Twenty-five of these species belong to the 117 species in the data set with multiple sequences that have a minimum overlap of 300 bp; i.e., when multiple individuals are sequenced 21% of the species lack unique barcodes.
|
|
|
Success of Tree-Based DNA Identification Techniques
According to Hebert et al.'s (2003a) criteria and depending on tree reconstruction techniques (NJ, parsimony, Bayesian), only 40% to 47% of all queries representing 22% to 23% of all species are successfully identified (Table 4; supplementary material, available online at http://systematicbiology.org). Many sequences (37% to 43%) and species (15% to 16%) fail to form species-specific clusters and are misidentified. Due to the lack of conspecific sequences in the data set, a large proportion of species (62%) and sequences (16%) are ambiguous and remain unidentified. The highest identification success and lowest misidentification rate are observed for neighbor-joining trees followed by Bayesian and parsimony trees. Compared to the first set of tree-based identification criteria, the second set based on the revised rules described in the Introduction have a higher success rate (60% to 63%), whereas the proportion of ambiguously placed sequences remains high (39%). Misidentifications are rare using both criteria (1.2%; Table 4). Note that the success rate for sequences is higher under the latter criterion, because it does not require monophyly; i.e., conspecific sequences found in two different clades on the tree can be identified as long as the name assignment is unambiguous. Under Hebert et al.'s (2003a) criterion, all sequences for the species would fail because it is not "monophyletic."
|
Success of Similarity-Based DNA Identification Techniques (see supplementary material, available online at: http://systematicbiology.org)
Success under "best match" is 67.7%, 79 queries are ambiguous (5.3%), and 361 are misidentified (Table 3; 27.1%). The data set contains 588 sequences whose best match is an identical sequence. Of these, 35 have an allospecific identical match (6.0%; see Table 2 for corresponding results with larger sequence overlap).
In order to be able to use best close match, we first determined that 95% of all intraspecific distances fall into the interval from 0% to 3.34%; i.e., the latter value was used to decide whether a query had a close enough barcode match for identification. Success under best close match is 66.3%, 5.9% of all sequences are misidentified, and the remaining queries (26.4%) remain unidentified because they have no match below 3.34% (Table 3).
For the "all species barcodes" criterion the success rate is 40.6%, in 23.0% there is no match below the threshold, for 35.2% the result is ambiguous (< 2 conspecific barcodes or mixture of con- and allospecific barcodes top the list), and 1.2% of all queries are misidentified.
DNA Taxonomy: Profiles Based on Distance Thresholds
Because the distances between three sequences do not have to be equilateral, a fixed threshold value cannot be maintained. The following increases in cutoff values are observed within the respective sets: 2%
4.8%; 3%
4.8%; 4%
6.1%; 5%
8.5 (Table 5). All cutoff values lead to species-level classifications that are radically different from accepted species limits. For example, for the popular 3% threshold, only 47% of the clusters agree with traditional species and one cluster contains eight species. Corresponding results for other thresholds are found in Table 5.
|
| Discussion |
|---|
|
|
|---|
Many biologists have argued that the future of descriptive taxonomy will depend on successfully embracing new techniques. Many ideas have been proposed (Godfray, 2002) and much progress has been achieved by, for example, digitizing taxonomic information, using new microscopic techniques, and developing interactive identification tools (De Ley et al., 2005; Dunn, 2003; Klaus et al., 2003; Marshall, 2003; Prendini, 2005; Sperling, 2003; Thacker, 2003; Wheeler, 2004; Will et al., 2005). It is only natural that the discussion has now turned to the contribution of DNA sequences. Uncontroversial is their great promise for associating morphologically disparate life history stages, associating males and females, solving species limits for polymorphic and cryptic species, and identifying artefacts made of materials derived from endangered species (Birstein et al., 1998; Hebert et al., 2003a; Marshall, 2003; Paquin and Hedin, 2004; Palumbi and Cipriano, 1998; Tautz et al., 2003). However, controversial is the extent to which DNA sequences and biologists only versed in molecular techniques should and can replace the morphological tools and retiring traditional taxonomists. Most systematists would argue that a more straightforward solution to employing molecular techniques is hiring new taxonomists (Lee, 2004; Minelli, 2003; Will and Rubinoff, 2004).
Intra- and Interspecific Variabilities
Much of the literature on DNA barcoding and DNA taxonomy consists of theoretical arguments in favor of and against the use of DNA sequences in taxonomy. We believe that these arguments are necessary and useful, but we also believe that more empirical information is needed. One issue that has been discussed from a theoretical point of view but only partially addressed with empirical data (but see Avise, 2000; Avise and Walker, 1999; Johns and Avise, 1998) is the overlap between intraspecific and interspecies genetic variabilities (Stoeckle, 2003; Ward et al., 2005; Will and Rubinoff, 2004). We are here revealing that the overlap is extensive in Diptera (0% to 15.5%) and that many of the pairwise distances for congeneric sequences fall into the area of overlap (99%; Fig. 2). The overlap remains considerable even when the extreme 5% of all intra- and interspecific distances are removed (2.31% to 3.34%: 5.2% of distances). These results are not unexpected given that numerous phylogeographic studies have repeatedly contradicted Hebert's contention that "the gene's [COI] sequence doesn't appear to vary among individuals of the same species" (see Pennisi, 2003). Given the variability of COI, it is obviously insufficient to generate a single DNA barcode per species. Instead, several individuals from different parts of a species' range have to be included in serious DNA barcoding projects. Our results also highlight that the "barcode" analogy between species barcodes and the kinds of barcodes used in industry is problematic given that variable barcodes would never be tolerated in the product barcodes used for commercial purposes (Lee, 2004; Moritz and Cicero, 2004).
Also frequently discussed in the theoretical literature is the problem of identical barcodes shared by several species (Ferguson, 2002; Floyd et al., 2002; Quicke, 2004; Tautz et al., 2003). The existence of shared barcodes is beyond doubt, but for a technique like DNA barcoding, it is more important to know how common they are. We find that the incidence at the level of individual barcodes is moderate. Of the queries with identical barcode matches, only 6% share their barcode with an allospecific species. However, this result nevertheless implies that it is impossible at the 5% level to support the intuitive conclusion that identical barcodes also come from the same species.
We find it more worrisome that there are so many species with identical consensus barcodes (21% of species with multiple sequences). Consensus sequences have numerous drawbacks in the context of phylogenetic reconstruction and study of molecular evolution (e.g., Page and Holmes, 1998), but they are of interest in evaluating the performance of DNA barcodes for diagnostic purposes. Ideally, a character used for species identification and description should be diagnostic in the sense that it unambiguously distinguishes all individuals of one species from the individuals of all other species (e.g., DeSalle et al., 2005; Panchen 1992: "monothetic" versus "polythetic" species; Winston, 1999). In traditional taxonomy as well as species described based on DNA sequence evidence, such characters are provided in the differential diagnosis for the new species (Bond, 2004; Bond and Sierwald, 2003; Winston, 1999). Our results suggest that COI is not a very good diagnostic tool in Diptera because 21% of all species for which multiple sequences are available have identical consensus sequences. Our finding does not imply that COI cannot be used for species identification, but for many species determinations will have to rely on indirect techniques such as tree-based assignment of specimens to sequence clusters or probability-based statements about the presence of nucleotides at particular sites. The lack of species-specific barcodes also bodes poorly for attempts to develop short oligonucleotide probes for species identification (Gibbs et al., 2005; Summerbell et al., 2005). Several authors had envisioned microarray or DNA chip approaches to species identification, but no probe will be able to distinguish between species that share COI consensus barcodes. We believe that the high frequency of shared consensus barcodes indicates that DNA barcoding will ultimately only be useful for identifying species to species groups. For many applied purposes of species identification (e.g., biosecurity: Armstrong and Ball, 2005; identification of medically important fungi: Summerbell et al., 2005), this will be satisfactory, but the real challenge in taxonomy is to find new and better techniques for delimiting and identifying closely related species (Sperling, 2003).
Identification Success Rates
Overall, we find that for our data set the identification success rates for DNA barcoding are unacceptably low and never exceed 70% (Fig. 3). Furthermore, this low success rate does not improve even if sequences with greater sequence overlap are used (Tables 2 and 3). However, larger data sets will be needed to rigorously test this prediction. In Diptera identification, success actually declines with increasing sequence overlap, which we believe to be due to the relatively small number of sequences with more than 500 base pairs of overlap. Whether these success rates are high enough to justify the considerable expense for barcoding all species and obtaining sequences for unidentified specimens remains a judgement call for the user. However, we doubt that DNA barcoding has a bright future unless the identification success rates can be increased considerably. Incidentally, our best identification rate of 68% is in line with the rates reported in Will and Rubinoff (2004) and decades of research on intraspecific sequence variability in a wide range of taxa (summarized in Funk and Omland, 2003). Funk and Omland (2003) found that 23% of the 2319 assayed species did not form monophyletic groups and that in two-thirds of these cases, the "polyphyly" was supported by bootstrap values above 70%. For arthropods, the rate was even higher (26.5% of 702 spp. surveyed). Furthermore, Moritz and Cicero (2004) found that 74% of 29 surveyed sister species pairs of birds would not be recognized as species if Hebert et al.'s (2003b) rule for delimiting species was applied; i.e., our results appear more in line with Will and Rubinoff's (2004), Funk and Omland's (2003), and Moritz and Cicero's (2004) assessment than the empirical studies that have been published by proponents of DNA barcoding.
|
The reader of the DNA barcoding literature will be surprised by our low success rates, which contrast sharply with the perfect or near perfect success rates reported elsewhere (Barrett and Hebert, 2005; Hebert and Gregory, 2005; Hebert et al., 2003a, 2004b; Hogg and Hebert, 2004). We believe that the main reason for this discrepancy is study design. Previously, two different kinds of tests of barcoding had been performed. The first was based on extensive compilations of intra- and interspecific COI distances across a wide range of taxa (Barrett and Hebert, 2005; Floyd et al., 2002; Hebert et al., 2003a; Hogg and Hebert, 2004) and they revealed that the average distances between congeneric species is relatively large. However, average distances between species are of little relevance when it comes to identifying closely related species. This is familiar to all taxonomists who know that it may be easy to separate a species from the "average" species in its genus but that it can still be very challenging to distinguish sister species. Contrary to suggestions in the literature (e.g., Barrett and Hebert, 2005; Hebert et al., 2003a; Hogg and Hebert, 2004), average similarities to other species in a genus are irrelevant for predicting identification success for closely related species.
The second kind of empirical test of DNA barcoding was conducted based on specimens collected at one or a few sites. The problem with these studies is the lack of consideration for geographic variation (Moritz and Cicero, 2004; Prendini, 2005; Sperling, 2003; Will and Rubinoff, 2004; Ward et al., 2005), although it is well known that species with a wide geographic distribution often contain a considerable amount of genetic variability. By only sampling individuals from a single locality, this variability is not considered and the distinctness of species barcodes is easily overestimated. Sparse taxon sampling at the species level also makes it likely that mostly distantly related species are included in the test, thus increasing the chance of finding distinct species-specific sequence differences (e.g., Will and Rubinoff, 2004). We believe that the dense taxon sampling in our data set accounts for some of the identification problems encountered in our study, because some genera and/or species groups had clearly been sampled very extensively by experts carrying out phylogeographic studies within closely related species and/or populations (Appendix 1; available online at http://systematicbiology.org).
Techniques for identifying sequences to species using DNA barcoding are still in their infancy, but new methods and algorithms are rapidly appearing (e.g., Blaxter et al., 2005; Nilsson et al., 2004; Steinke et al., 2005). Given the problems with tie-trees and the problematic assumption of species monophyly, we do not believe that trees are a promising tool (see also DeSalle et al., 2005; Pozhitkov and Tautz, 2002; Steinke et al., 2005). This is particularly so because each tree has many nested clades and it is thus impossible to decide based on tree topology alone which node on the tree delimits a population, species, or supraspecific taxon (see our Homo sapiens–great ape example in the introduction). Instead, some kind of similarity criterion has to be used in conjunction with the tree before one can decide whether a particular query is conspecific or allospecific to its closest match. Given that a similarity metric will also be needed for tree-based identification, we would argue that one may as well abandon the tree-based approaches and instead directly use the similarity metric as has been implemented in several recent algorithms (e.g., Blaxter et al., 2005; Steinke et al., 2005). We furthermore find that for our data set, similarity-based techniques outperform tree-based identification. For example, the success rate of tree-based identification is below 50% (Table 4), whereas best close match yields a much higher success rate (67%) and has a relatively low incidence of misidentification (6%; Table 3). Therefore, if we had to use DNA sequences for species identification, we would prefer this technique, although a large proportion of the queries remain unidentified because they lack close matches in the barcode database. We nevertheless consider this result preferable over using an identification technique like best match that yields marginally higher success rates (68%) but also a large number of misidentifications (27%). All species barcodes is the most conservative identification technique, but given its low success rate of 39% it will probably only be justifiable for forensic purposes.
Is a Complete Barcode Database Obtainable?
Best match would perform much better if it was applied to a data set from which single-sequence species have been removed. However, in a real-life situation it is impossible to know whether a new query is coming from a "new" species or from a species that is already represented in the data set (Scotland et al., 2003; Seberg, 2004; Will and Rubinoff, 2004). Of course, the other solution would be using a database that has barcodes for all species on the planet. Creating such a database may alleviate some of the problems with DNA barcoding and is thus the goal of the Barcoding Life Consortium. However, we agree with many systematists that this goal is completely unrealistic (Marshall, 2003). The existing strategies for creating a complete barcode database mostly rely on tissues from existing cryo-collections and museum specimens. However, the former will only yield barcodes for relatively few, mostly common, and/or widespread species. The proposed use of tissues from identified specimens in natural history museums is equally inadequate and problematic.
First, museums can at best provide identified specimens for described species—for poorly known groups this is a minute fraction of the actual diversity. It is in this context revealing that current work on barcoding concentrates on taxonomically well-known groups such as birds, fish, and Lepidoptera (Hebert and Gregory, 2005; Hebert et al., 2003a, 2004b). However, if barcoding is restricted to these kinds of taxa, then the very groups that are most in need of new taxonomic techniques will be left out. For example, only 50,000 of the estimated 1 million species of nematodes have been described (see Seberg, 2004) and recent revisions of invertebrates routinely double species counts (Ponder and Lunney, 1999). This indicates that prerevision collections contain at best identified material for 50% of all species. Species richness estimation furthermore clearly indicates that many additional species remain uncollected and are thus not even represented in the collections (Marshall, 2003; Seberg, 2004; Meier and Dikow, 2004). All these species will have to be formally described before a reference sequence can be deposited in a barcode database. Barcoding can thus never be faster than traditional taxonomy. We would even argue that barcoding is currently even exasperating the taxonomic crisis by submitting large numbers of poorly identified sequences for putatively cryptic species to GenBank. Of the 5426 sequences with the keyword "barcode" that had been submitted by February 2006, 2459 (45%) were only identified to genus. At the same time, Hebert and Gregory (2005) explicitly state DNA sequences should not be used for describing new species so that these sequences will probably remain unidentified for the foreseeable future.
Second, much of the identified material in museums is either unsuitable and/or too valuable for molecular work (Quicke, 2004; Will and Rubinoff, 2004). A good example are the approximately 40% of known beetle species that have only been collected at a single locality (see Seberg, 2004). Prior to pinning, most specimens in all likelihood went through the conventional entomological pinning procedures, which involves softening in a moist chamber for several days. Such treatment will degrade much of the DNA and probably explains the generally low gene-amplification success for DNA extracted from pinned insect specimens (e.g., 31% for "archival moths": Hajibabaei et al., 2005). The material for other groups is even less accessible. For example, nematodes, mites, and fish were in the past mostly preserved in formaldehyde and/or slide-mounted and especially many nematode specimens have been lost (De Ley et al., 2005).
Third, even if museum specimens are well preserved and available for DNA extraction, they are often not a good source for generating barcodes because a large proportion are misidentified (see examples in Meier and Dikow, 2004; Ward et al., 2005). In conclusion, no strategy is in sight that will yield an even approximately complete barcode database for those groups that urgently require new identification techniques. However, our and one other recent empirical study (e.g., Meyer and Paulay, 2005) document that barcoding will yield low identification success if the barcode database is incomplete.
DNA Taxonomy
Supporters of DNA taxonomy have suggested that a functional taxonomy can be based wholly on DNA sequences. The most popular approach involves the use of fixed sequence-divergence values for delimiting species (e.g., 3%; Barrett and Hebert, 2005; Hebert et al., 2003a). One problem with this approach is that the choice of threshold value is arbitrary and renders species borders a matter of opinion (Ferguson, 2002; Prendini, 2005; Will and Rubinoff, 2004). However, an equally serious and more fundamental theoretical challenge is that it is logically impossible to maintain such threshold values (Fig. 1). We find numerous cases in Diptera where two out of the three pairwise distances for three sequences remain below a given threshold while the third sequence exceeds the threshold (Fig. 1). For example, for 3% sequence clusters, 13% of the 106 DNA profiles have pairwise distances in excess of the threshold. The largest observed distance in a 3% cluster is 4.8%, which implies that even if a query sequence differs by 4.5% from another sequence the query sequence may still belong to the same 3% DNA species (Table 5); i.e., species based on thresholds only lead to a convenient taxonomic system as long as few sequences are known. As taxon sampling improves, the maze of pairwise distances becomes impenetrable and convenient fixed values are unenforceable unless one is willing to accept that sequence input order influences the composition of the sequence clusters (see Blaxter et al., 2005). Note that these results are largely independent of which pairwise distance is used and whether the sequence overlap is small (300 bp) or large (600 bp; Fig. 4).
|
Furthermore, for the Diptera data set, the majority (53%) of the 3% profiles contradict traditionally recognized species with some having sequences in six different profiles (Table 5). Instead of speeding up taxonomic work, a sequence-based taxonomy would thus have to start with the redescription of most species. This will increase the taxonomic impediment and further slow taxonomic progress. Because new sequences can change the pairwise-distance profiles within DNA species by fusing two clusters, we also dispute that a DNA taxonomy would be more stable than traditional taxonomy as has been claimed. We tested multiple threshold values, but all lead to major revisions of existing species-level classifications and all would cause complete taxonomic chaos in Diptera (Table 5). Given these problems it is surprising that distance-based approaches to taxonomy remain so popular (Blaxter, 2003; Chu et al., 1999, 2003; Shih et al., 2004; Sun et al., 2003; Tang et al., 2000, 2003; Zhi et al., 1996).
Proponents of a DNA-based taxonomy may argue that species paraphyly, large intraspecific variability, and small interspecific variability are reasons to revise the borders of traditionally recognized species (Barrett and Hebert, 2005; Blaxter, 2003; Hebert et al., 2003a, 2004b; Hogg and Hebert, 2004; Tautz et al., 2003; Zhi et al., 1996). However, there are compelling reasons why species boundaries cannot be mechanically modified in response to genetic distances (Ferguson, 2002; Lee, 2004). The genes used in molecular taxonomy vary predominantly in selectively largely neutral nucleotide positions and therefore distances among closely related sequences reflect to a large extent time of divergence; i.e., if we were to break up species with large intraspecific variability, we would effectively deny that species can be ancient. Similarly, given standard rates of COI evolution (Pestano et al., 2003), lumping species with low intraspecific variability is equivalent to refusing species status to species younger than 1.5 million years regardless of reproductive isolation and morphological distinctiveness. Mechanically modifying species borders in response to small distances also ignores all research indicating that reproductive isolation can evolve rapidly via sexual selection (e.g., Ferguson, 2002; Lee, 2004; Moritz and Cicero, 2004) and that mitochondrial genes can be misleading due to phenomena such as lineage sorting, male-biased gene flow, introgression following hybridization, and numts (e.g., Moritz and Cicero, 2004). One should also remember that in contrast to the characters used in traditional taxonomy (e.g., genitalia, mating calls, coloration, etc.; e.g., Lee, 2004), the genes used in molecular taxonomy are not functionally correlated with speciation (Ferguson, 2002; Lee, 2004).
| Conclusions |
|---|
|
|
|---|
Our empirical test of DNA taxonomy reveals that an identification system exclusively relying on COI for Diptera successfully determines less than 70% of all sequences. Misidentification rates are low, but for a significant number of species the sequences are not divergent enough or too divergent to allow identification. Given the prevalence of violations of pairwise-distance thresholds, an alpha-taxonomy based on DNA sequences is similarly problematic. However, it is important to stress that these results do not question the value of DNA sequences in taxonomy. DNA sequences are already invaluable for many purposes and should be extensively used where needed (Armstrong and Ball, 2005; Birstein et al., 1998; Hebert et al., 2003a; Palumbi and Cipriano, 1998; Paquin and Hedin, 2004; Tautz et al., 2003; Will et al., 2005). There is also evidence that DNA barcoding can be successful in some taxa and/or at a regional scale. For example, we find that Aedes and Anopheles mosquitoes identify relatively well based on COI and expect the same for species samples lacking large numbers of closely related species (Armstrong and Ball, 2005; Moritz and Cicero, 2004; Sperling, 2003; Stoeckle, 2003; Summerbell et al., 2005). DNA sequences also allow for the identification of genetic diversity and unusual patterns of genetic variability that require further study (e.g., Hebert and Gregory, 2005; Paquin and Hedin, 2004; Quicke, 2004). Over the past 250 years the character basis for describing and identifying species has continually broadened, and DNA sequences are a very valuable new tool. Sequences certainly improve our ability to recognize and describe species (DeSalle et al., 2005; Paquin and Hedin, 2004; Sperling, 2003; Will et al., 2005). For some groups of organisms and life history stages, the old tools have been struggling and here DNA sequences will turn out to be the character source of choice (e.g., Armstrong and Ball, 2005; Paquin and Hedin, 2004). For other clades, DNA sequences will only play a minor role. Barcoding all of life may be a bold proposal that has recently been skilfully promoted (Sperling, 2003), but it is really quite unnecessary for taxa like birds (Dunn, 2003) and can be misleading in other taxa like Diptera. Systematists have not failed to describe all species because of a lack of effort, but rather, because of the overwhelming species diversity in combination with the complexities involved in recognizing closely related species. Simplistic proposals based on DNA sequences only create pseudosolutions to real biological problems and what is really needed is an integrative approach to taxonomy embracing all available evidence (DeSalle et al., 2005; Paquin and Hedin, 2004; Will et al., 2005).
|
| Acknowledgments |
|---|
|
|
|---|
We gratefully acknowledge comments on the manuscripts by numerous colleagues, including two anonymous reviewers, Roderic Page, and Marshal Hedin. Financial support came from the Academic Research Fund grants R-154-000-256-112 and R154-000-270-112 from the Ministry of Education, Singapore.
| References |
|---|
|
|
|---|
-
Armstrong K. F., Ball S. L. DNA barcodes for biosecurity: Invasive species identification. Philos. Trans. R. Soc. Lond. B Biol. Sci. (2005) 360:1813–1823.
Avise J. C. Phylogeography: The history and formation of species (2000) Cambridge, Massachusetts: Harvard University Press.
Avise J. C., Walker D. Species realities and numbers in sexual vertebrates: Perspectives from an asexually transmitted genome. Proc. Natl. Art. Sci. USA (1999) 96:992–995.[CrossRef]
Backeljau T., De Bruyn L., De Wolf H., Jordaens K., Van Dongen S., Winnepennincks B. Multiple UPGMA and neighbor-joining trees and the performance of some computer packages. Mol. Biol. Evol. (1996) 13:309–313.[Web of Science]
Baker C. S., Dalebout M. L., Lavery S., Ross H. A. www. DNA-surveillance: Applied molecular taxonomy for species conservation and discovery. TREE (2003) 18:271–272.
Barrett R. D. H., Hebert P. D. N. Identifying spiders through DNA barcodes. Can. J. Zool. (2005) 83:481–491.[CrossRef]
Birstein V. J., Doukakis P., Sorkin B., Desalle R. Population aggregation analysis of three caviar-producing species of sturgeons and implications for the species identification of black caviar. Cons. Biol. (1998) 12:766–775.[CrossRef]
Blaxter M. Counting angels with DNA. Nature (2003) 421:122–124.[CrossRef][Medline]
Blaxter M., Floyd R. Molecular taxonomics for biodiversity surveys: Already a reality. TREE (2003) 18:268–269.
Blaxter M., Mann J., Chapman T., Thomas F., Whitton C., Floyd R., Abebe E. Defining operational taxonomic units using DNA barcode data. Philos. Trans. R. Soc. Lond. B Biol. Sci. (2005) 360:1935–1943.
Bond J. E. Systematics of the Californian euctenizine spider genus Apomastus (Araneae: Mygalomorphae: Cyrtaucheniidae): The relationship between molecular and morphological taxonomy. Invert. Syst. (2004) 18:361–376.[CrossRef]
Bond J. E., Sierwald P. Molecular taxonomy of the Anadenobolus excisus (Diplopoda: Spirobolida: Rhinocricidae) species-group on the Caribbean island of Jamaica. Invert. Syst. (2003) 17:515–528.[CrossRef]
Chu K. H., Ho H. Y., Li C. P., Chan T. Y. Molecular phylogenetics of the mitten crab species in Eriocheir, sensu lato (Brachyura: Grapsidae). J. Crust. Zool. (2003) 23:738–746.
Chu K. H., Tong J., Chan T. Y. Mitochondrial cytochrome oxidase I sequence divergence in some Chinese species of Charybdis (Crustacea: Decapoda: Portunidae). Biochem. Syst. Ecol. (1999) 27:461–468.[CrossRef][Web of Science]
Crisp M. D., Chandler G. T. Paraphyletic species. Telopea (1996) 6:813–844.
De Ley P., De Ley I. T., Morris K., Abebe E., Mundo-Ocampo M., Yoder M., Heras J., Waumann D., Rocha-Olivares A., Burr A. H. J., Baldwin J. G., Thomas W. K. An integrated approach to fast and informative morphological vouchering of nematodes for applications in molecular barcoding. Philos. Trans. R. Soc. Lond. B Biol. Sci. (2005) 360:1945–1958.
DeSalle R., Egan M. G., Siddall M. The unholy trinity: Taxonomy, species delimitation and DNA barcoding. Philos. Trans. R. Soc. Lond. B Biol. Sci. (2005) 360:1905–1916.
Dunn C. P. Keeping taxonomy based in morphology. TREE (2003) 18:270–271.
Ebach M. C., Holdrege C. DNA barcoding is no substitute for taxonomy. Nature (2005) 434:697.[Medline]
Ferguson J. W. H. On the use of genetic divergence for identifying species. Biol. J. Linn. Soc. (2002) 75:509–516.[CrossRef][Web of Science]
Floyd R., Abebe E., Papert A., Blaxter M. Molecular barcodes for soil nematode identification. Mol. Ecol. (2002) 11:839–850.[CrossRef][Medline]
Frost D. R., Crafts H. M., Fitzgerald L. A., Titus T. A. Geographic variation, species recognition, and molecular evolution of cytochrome oxidase I in the Tropidurus spinulosus complex (Iguania: Tropiduridae). Copeia (1998) 1998:839–851.[CrossRef]
Funk D. J., Omland K. E. Species-level paraphyly and polyphyly: Frequency, causes, and consequences, with insights from animal mitochondrial DNA. Ann. Rev. Ecol. Syst. (2003) 34:397–423.[CrossRef][Web of Science]
Gibbs M. J., Armstrong J. S., Gibbs A. J. Individual sequences in large sets of gene sequences may be distinguished efficiently by combinations of shared sub-sequences. Bmc Bioinformatics (2005) 6.
Godfray H. C. J. Challenges for taxonomy. Nature (2002) 417:17–19.[CrossRef][Medline]
Godfray H. C. J., Knapp S. Taxonomy for the twenty-first century—Introduction. Philos. Trans. R. Soc. Lond. B Biol. Sci. (2004) 359:559–569.
Goloboff P., Farris J. S., Nixon K. T.N.T.: Tree analysis using new technology. Program and documentation (2003) available from the authors, and at www.zmuc.dk/public/phylogeny.
Hajibabaei M., DeWaard J. R., Ivanova N. V., Ratnasingham S., Dooh R. T., Kirk S. L., Mackie P. M., Hebert P. D. N. Critical factors for assembling a high volume of DNA barcodes. Philos. Trans. R. Soc. Lond. B Biol. Sci. (2005) 360:1959–1967.
Harris D. J. Can you bank on GenBank? TREE (2003) 18:317–319.
Hebert P. D. N., Barrett R. D. H. Reply to the comment by L. Prendini on "Identifying spiders through DNA barcodes". Can. J. Zool. (2005) 83:481–491.[CrossRef]
Hebert P. D. N., Cywinska A., Ball S. L., deWaard J. R. Biological identifications through DNA barcodes. Proc. Roy. Soc. Ser. B (2003a) 270:313–321.
Hebert P. D. N., Gregory T. R. The promise of DNA barcoding for taxonomy. Syst. Biol. (2005) 54:852–859.
Hebert P. D. N., Ratnasingham S., deWaard J. R. Barcoding animal life: Cytochrome c oxidase subunit 1 divergences among closely related species. Proc. Roy. Soc. Ser. B (2003b) 270:S96–S99.
Hebert P. D. N., Ratnasingham S., Dooh R. Barcodes of Life (2004a) http://www.barcodinglife.com/.
Hebert P. D. N., Stoeckle M., Zemlak T. S., Francis C. M. Identification of birds through DNA barcodes. PLoS Biol. (2004b) 2:1657–1663.[Web of Science]
Hogg I. D., Hebert P. D. N. Biological identification of springtails (Hexapoda: Collembola) from the Canadian Arctic, using mitochondrial DNA barcodes. Can. J. Zool. (2004) 82:749–754.[CrossRef]
Janzen D. Now is the time. Philos. Trans. R. Soc. Lond. B Biol. Sci. (2004) 359:731–732.
Johns G. C., Avise J. C. A comparative summary of genetic distances in the vertebrates from the mitochondrial cytochrome b gene. Mol. Biol. Evol. (1998) 15:1481–1490.
Klaus A. V., Kulasekera V. L., Schawaroch V. Three-dimensional visualization of insect morphology using confocal laser scanning microscopy. J. Microsc. (2003) 212:107–121.[Web of Science][Medline]
Lee M. S. Y. The molecularisation of taxonomy. Inv. Syst. (2004) 18:1–6.[CrossRef]
Lipscomb D., Platnick N., Wheeler Q. The intellectual content of taxonomy: A comment on DNA taxonomy. TREE (2003) 18:65–66.
Marshall E. Taxonomy: Will DNA Bar Codes Breathe Life Into Classification? Science (2005) 307:1037.
Marshall S. The real costs of insect identification [opinion page]. Newsl. Biol. Surv. Can. (Terrestrial Arthropods) (2003) 22:15–18.
Meier R., Dikow T. Significance of specimen databases from taxonomic revisions for estimating and mapping the global species diversity of invertebrates and repatriating reliable and complete specimen data. Cons. Biol. (2004) 18:478–488.[CrossRef]
Meyer C. P., Paulay G. DNA barcoding: Error rates based on comprehensive sampling. PLoS Biol. (2005) 3:2229–2238.[Web of Science]
Moritz C., Cicero C. DNA barcoding: Promise and pitfalls. PLoS Biol. (2004) 2:1529–1531.[Web of Science]
Nilsson R. H., Larsson K. H., Ursing B. M. Galaxie—CGI scripts for sequence identification through automated phylogenetic analysis. Bioinformatics (2004) 20:1447–1452.
Nylander J. A. A. MrModelTest, version 2 (2004) Evolutionary Biology Centre, Uppsala University.
Page R. D. M., Holmes E. C. Molecular evolution: A phylogenetic approach (1998) Oxford, UK: Blackwell Science.
Palumbi S. R., Cipriano F. Species identification using genetic tools: The value of nuclear and mitochondrial gene sequences in whale conservation. J. Hered. (1998) 89:459–464.
Panchen A. L. Classification, evolution, and the nature of biology (1992) Cambridge, UK: Cambridge University Press.
Paquin P., Hedin M. M. The power and perils of "molecular taxonomy": A case study of eyeless and endangered Cicurina (Araneae: Dictynidae) from Texas caves. Mol. Ecol. (2004) 13:3239–3255.[Medline]
Pons J., Vogler A. P. Complex pattern of coalescence and fast evolution of a mitochondrial rRNA pseudogene in a recent radiation of tiger beetles. Mol. Biol. Evol. (2005) 22:991–1000.
Pennisi E. Modernizing the tree of life. Science (2003) 300:1692–1697.
Pestano J., Brown R. P., Suarez N. M., Baez M. Diversification of sympatric Sapromyza (Diptera: Lauxaniidae) from Madeira: Six morphological species but only four mtDNA lineages. Mol. Phylogenet. Evol. (2003) 27:422–428.[CrossRef][Web of Science][Medline]
Ponder W., Lunney D. The other 99%—The conservation and biodiversity of invertebrates. Trans. Roy. Zool. Soc. NSW. (1999).
Pozhitkov A. E., Tautz D. An algorithm and program for finding sequence specific oligonucleotide probes for species identification. BMC Bioinformatics (2002) 3.
Prendini L. Comment on "Identifying spiders through DNA barcodes". Can. J. Zool. (2005) 83:498–504.[CrossRef]
Quicke D. L. J. The world of DNA barcoding and morphology—Collision or synergism and what of the future? Systematist (2004) 23:8–12.
Ronquist F., Huelsenbeck J. P. MrBayes 3: Bayesian phylogenetic inference under mixed models. Bioinformatics (2003) 19:1572–1574.
Ruedas L. A., Salazar-Bravo J., Dragoo J. W., Yates T. L. The importance of being earnest: What, if anything, constitutes a specimen examined? Mol. Phylogenet. Evol. (2000) 17:129–132.[CrossRef][Web of Science][Medline]
Scotland R., Hughes C., Donovan B., Wortley A. The Big Machine and the much-maligned taxonomist. Syst. and Biodiv. (2003) 1:139–143.[CrossRef]
Seberg O. The future of systematics: Assembling the Tree of Life. The Systematist (2004) 23:2–8.
Seberg O., Humphries C. J., Knapp S., Stevenson D. W., Petersen G., Scharff N., Andersen N. M. Shortcuts in systematics? A commentary on DNA-based taxonomy. TREE (2003) 18:63–65.
Shih H.-T., Ng P. K. L., Chang H.-W. The systematics of the genus Geothelphusa (Crustacea, Decapoda, Brachyura, Potamidae) from southern Taiwan: A molecular appraisal. Zool. Stud. (2004) 43:561–570.
Sperling F. DNA barcoding. Deus et machina [opinion page]. Newsl. Biol. Surv. Can. (Terrestrial Arthropods) (2003) 22:50–53.
Stahls G., Stuke J.-H., Vujic A., Doczkal D., Muona J. Phylogenetic relationships of the genus Cheilosia and the tribe Rhingiini (Diptera, Syrphidae) based on morphological and molecular characters. Cladistics (2004) 20:105–122.[CrossRef][Web of Science]
Steinke D., Vences M., Salzburger W., Meyer A. TaxI: A software tool for DNA barcoding using distance methods. Philos. Trans. R. Soc. Lond. B Biol. Sci. (2005) 360:1975–1980.
Stoeckle M. Taxonomy, DNA, and the Bar Code of Life. BioScience (2003) 53:796–797.[CrossRef][Web of Science]
Summerbell R. C., Levesque C. A., Seifert K. A., Bovers M., Fell J. W., Diaz M. R., Boekhout T., de Hoog G. S., Stalpers J., Crous P. W. Microcoding: the second step in DNA barcoding. Philos. Trans. R. Soc. Lond. B Biol. Sci. (2005) 360:1897–1903.
Sun H. Y., Zhou K., Yang X. J. Phylogenetic relationships of the mitten crabs inferred from mitochondrial 16S rDNA partial sequences (Crustacean, Decapoda). Acta Zool. Sin. (2003) 49:592–599.
Swofford D. L. Phylogenetic analysis using parsimony (and other methods) (2002) Sunderland, Massachusetts: Sinauer Associates. *PAUP*, version 4.0b10.
Takezaki N. Tie trees generated by distance methods of phylogenetic reconstruction. Mol. Biol. Evol. (1998) 15:727–737.[Abstract]
Tang B., Zhou K., Song D., Yang G., Dai A. Molecular systematics of the Asian mitten crabs, genus Eriocheir (Crustacea: Brachyura). Mol. Phylogenet. Evol. (2003) 29:309–316.[CrossRef][Web of Science][Medline]
Tautz D., Arctander P., Minelli A., Thomas R. H., Vogler A. P. DNA points the way ahead in taxonomy. Nature (2002) 418:479.[Medline]
Tautz D., Arctander P., Minelli A., Thomas R. H., Vogler A. P. A plea for DNA taxonomy. TREE (2003) 18:70–74.
Thacker P. D. Morphology: The shape of things to come. BioScience (2003) 53:544–549.[CrossRef][Web of Science]
Thompson F. C. Biosystematic Database of World Diptera (2005) http://www.sel.barc.usda.gov/Diptera/biosys.htm.
Tong J. G., Chan T. Y., Chu K. H. A preliminary phylogenetic analysis of Metapenaeopsis (Decapoda: Penaeidae) based on mitochondrial DNA sequences of selected species from the Indo West Pacific. J. Crust. Biol. (2000) 20:541–549.[CrossRef][Web of Science]
Vilgalys R. Taxonomic misidentification in public DNA databases. New Phytol. (2003) 160:4–5.[CrossRef][Web of Science]
Ward R. D., Zemlak T. S., Innes B. H., Last P. R., Hebert P. D. N. DNA barcoding Australia's fish species. Philos. Trans. R. Soc. Lond. B Biol. Sci. (2005) 360:1847–1857.
Wheeler Q. D. Taxonomic triage and the poverty of phylogeny. Philosophical Philos. Trans. R. Soc. Lond. B Biol. Sci. (2004) 359:571–583.[CrossRef]
Wheeler Q. D., Meier R. Species concepts and phylogenetic theory. A debate (2000) New York: Columbia University Press.
Will K. W., Mishler B. D., Wheeler Q. D. The perils of DNA barcoding and the need for integrative taxonomy. Syst. Biol. (2005) 54:844–851.
Will K. W., Rubinoff D. Myth of the molecule: DNA barcodes for species cannot replace morphology for identification and classification. Cladistics (2004) 20:47–55.[CrossRef][Web of Science]
Wilson E. O. The encyclopedia of life. TREE (2003) 18:77–80.
Winston J. E. Describing species: Practical taxonomic procedure for biologists (1999) New York: Columbia University Press.
Zhi L., Karesh W. B., Janczewski D. N., Frazier-Taylor H., Sajuthi D., Gombek F., Andau M., Martenson J. S., O'Brien S. J. Genomic differentiation among natural populations of orang-utan (Pongo pygmaeus). Curr. Biol. (1996) 6:1326–1336.[CrossRef][Web of Science][Medline]
This article has been cited by other articles:
![]() |
M. T. Monaghan, R. Wild, M. Elliot, T. Fujisawa, M. Balke, D. J.G. Inward, D. C. Lees, R. Ranaivosolo, P. Eggleton, T. G. Barraclough, et al. Accelerated Species Inventory on Madagascar Using Coalescent-Based Models of Species Delineation Syst Biol, July 1, 2009; (2009) syp027v1. [Abstract] [Full Text] [PDF] |
||||
![]() |
D. M. Spooner DNA barcoding will frequently fail in complicated groups: An example in wild potatoes Am. J. Botany, June 1, 2009; 96(6): 1177 - 1189. [Abstract] [Full Text] [PDF] |
||||
![]() |
N.B. Barr, A. Cook, P. Elder, J. Molongoski, D. Prasher, and D.G. Robinson Application of a DNA barcode using the 16S rRNA gene to diagnose pest Arion species in the USA J. Mollus. Stud., May 1, 2009; 75(2): 187 - 191. [Full Text] [PDF] |
||||
![]() |
R. Meier, G. Zhang, and F. Ali The Use of Mean Instead of Smallest Interspecific Distances Exaggerates the Size of the "Barcoding Gap" and Leads to Misidentification Syst Biol, October 1, 2008; 57(5): 809 - 813. [Full Text] [PDF] |
||||
![]() |
A. Papadopoulou, J. Bergsten, T. Fujisawa, M. T Monaghan, T. G Barraclough, and A. P Vogler Speciation and DNA barcodes: testing the effects of dispersal on the formation of discrete sequence clusters Phil Trans R Soc B, September 27, 2008; 363(1506): 2987 - 2996. [Abstract] [Full Text] [PDF] |
||||
![]() |
A. B. Zhang, D. S. Sikes, C. Muster, and S. Q. Li Inferring Species Membership Using DNA Sequences with Back-Propagation Neural Networks Syst Biol, April 1, 2008; 57(2): 202 - 215. [Abstract] [Full Text] [PDF] |
||||
![]() |
H. A. Ross, S. Murugan, and W. L. Sibon Li Testing the Reliability of Genetic Methods of Species Identification via Simulation Syst Biol, April 1, 2008; 57(2): 216 - 230. [Abstract] [Full Text] [PDF] |
||||
![]() |
J. J. Wiens Species Delimitation: New Approaches for Discovering Diversity Syst Biol, December 1, 2007; 56(6): 875 - 878. [Full Text] [PDF] |
||||
![]() |
L. L. Knowles and B. C. Carstens Delimiting Species without Monophyletic Gene Trees Syst Biol, December 1, 2007; 56(6): 887 - 895. [Abstract] [Full Text] [PDF] |
||||
![]() |
M. Elias, R. I Hill, K. R Willmott, K. K Dasmahapatra, A. V.Z Brower, J. Mallet, and C. D Jiggins Limited performance of DNA barcoding in a diverse community of tropical butterflies Proc R Soc B, November 22, 2007; 274(1627): 2881 - 2889. [Abstract] [Full Text] [PDF] |
||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||









