Skip Navigation

Systematic Biology 2008 57(5):809-813; doi:10.1080/10635150802406343
This Article
Right arrow Extract Freely available
Right arrow FREE Full Text (PDF) Freely available
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in ISI Web of Science
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrow Search for citing articles in:
ISI Web of Science (1)
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Meier, R.
Right arrow Articles by Ali, F.
Right arrow Search for Related Content
PubMed
Right arrow Articles by Meier, R.
Right arrow Articles by Ali, F.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?

© 2008 Society of Systematic Biologists

The Use of Mean Instead of Smallest Interspecific Distances Exaggerates the Size of the "Barcoding Gap" and Leads to Misidentification

Edited by Kelly Zamudio

Rudolf Meier1, Guanyang Zhang1,2 and Farhan Ali1

1 Department of Biological Sciences, University Scholars Programme, National University of Singapore Science Drive 4, Singapore 117543, Singapore; E-mail: dbsmr{at}nus.edu.sg (R.M.)
2 Department of Entomology, University of California Riverside Riverside, California 92521, USA

Received December 8, 2007; Revised March 21, 2008; Accepted May 7, 2008 DNA barcoding is one of the best funded and most visible large-scale initiatives in systematic biology and has generated both much interest and controversy. DNA barcoding has also attracted significant support from foundations that had previously shown little interest in systematics. Yet, the project is controversial because many systematists feel that currently the conceptual foundation of DNA barcoding remains weak. This problem can only be alleviated through additional research that can lead to improved tools and concepts. Here, we scrutinize a key concept of DNA barcoding, the so-called barcoding gap (Meyer and Paulay, 2005), and use empirical data to document that it needs to be computed based on the smallest instead of the mean interspecific distances.

In the literature on DNA barcoding, the "barcoding gap" (Meyer and Paulay, 2005) refers to the separation between mean intra- and interspecific sequence variability for congeneric COI sequences. The barcoding gap is so essential to barcoding that a widely cited publication was dedicated to documenting these gaps across major metazoan taxa (Hebert et al., 2003b). It is also regularly mentioned in articles promoting barcoding to a broader audience (Check, 2005; Cognato and Caesar, 2006; Dasmahapatra and Mallet, 2006) and is one of the few metrics included in the Web-based identification system BOLD, "The Barcode of Life Data System," which is a major identification tool for the DNA barcoding community (http://www.barcodinglife.org; Ratnasingham and Hebert, 2007). Large barcoding gaps are routinely used to predict DNA-barcoding success for the taxon under study (Hebert et al., 2003a, 2003b, 2004a, 2004b; Hogg and Hebert, 2004; Powers, 2004; Zehner et al., 2004; Armstrong and Ball, 2005; Ball et al., 2005; Barrett and Hebert, 2005; Lorenz et al., 2005; Saunders, 2005; Smith et al., 2005, 2006; Ward et al., 2005; Cywinska et al., 2006; Hajibabaei et al., 2006a, 2006b; Lefebure et al., 2006; Clare et al., 2007; Seifert et al., 2007). However, here we argue and document that barcoding gaps are currently incorrectly computed and that the values reported in the barcoding literature are misleading. The main problem is that the barcoding gap is generally quantified as the difference between intraspecific and mean interspecific, congeneric distances, whereas we will argue here that for species identification only the smallest interspecific distance should be used. Other authors have also pointed out that the use of smallest interspecific distances would be more appropriate (see Sperling, 2003; Moritz and Cicero, 2004; Vences et al., 2005a, 2005b; Cognato, 2006; Meier et al., 2006; Meyer and Paulay, 2005; Roe and Sperling, 2007), but currently we lack a comparative study that documents that the two measures yield different results. Here we provide evidence based on 43,137 COI sequences from 12,459 Metazoan species that barcoding gaps based on mean interspecific distances are artificially inflated and that only smallest interspecific distances correctly reflect that species identification gets more difficult as more species are sampled.

Using DNA barcodes for species identification is analogous to identifying an unidentified specimen by comparing it to a reference collection of identified specimens. Initially one may compare an unidentified specimen to all identified material in the same genus, but ultimately the identification problem pares down to deciding whether a specimen belongs to one of a few, very similar, congeneric species. Determining an unidentified specimen to species is straightforward if the intraspecific variability is small—i.e., the unidentified specimen is a good match to a referenced species—and the differences between the best-matching species and the next best match is large—i.e., the specimen is a good match to only one of the referenced species. Analogously, the ease with which a query sequence can be identified to species is only dependent on how different it is from the most similar allospecific sequence, whereas its distinctness from a hypothetical "average" congeneric species does not matter (see Sperling, 2003; Moritz and Cicero, 2004; Vences et al., 2005a, 2005b; Cognato, 2006; Meier et al., 2006; Meyer and Paulay, 2005; Roe and Sperling, 2007). Yet, DNA barcoding publications and BOLD continue to report the mean instead of the smallest interspecific distances for congeneric species.

In order to quantify the interspecific distances we aligned 43,137 GenBank sequences for 12,459 species of Metazoa (see Table 1 and online Supplementary Material 1, available at http://www.systematicbiology.org) based on amino acid translations using Alignment-Helper (McClellan and Woolley, 2004) in conjunction with ClustalW (Thompson et al., 1994). For each sequence for the 4599 species with multiple sequences in the data set, the mean uncorrected, intraspecific distance was collected using TaxonDNA (Meier et al., 2006). We also determined for all sequences the mean and the smallest interspecific distances for congeneric species. We then calculated the overlap between intra- and interspecific variability after deleting the 5% largest intraspecific and the 5% smallest interspecific distances (Meier et al., 2006). For one sample (Coleoptera: 5431 sequences for 1942 species), we also tested whether uncorrected distances and estimates of pairwise sequence divergence under the K2P model yield similar results. The main reason for using the K2P model was its widespread use in the barcoding literature. However, in contrast to the barcoding literature, the pairwise sequence divergences are here expressed as substitutions per site (subs./site).


View this table:
[in this window]
[in a new window]

 
Table 1 Barcoding gaps and sequence variability (uncorrected sequence divergence) for Metazoa. Note that due to the reluctance of some researchers to submit identical haplotypes to GenBank, the mean intraspecific distances are likely overestimates.

 
A comparison of the different barcoding gaps reveals that an approach based on mean interspecific distances yields inflated estimates (Table 1). The differences are particularly striking for invertebrates, the group of animals with the largest need for new identification techniques. A typical example for an invertebrate group are the Coleoptera where the mean interspecific uncorrected distance is 11.2% ± 4.3%, whereas it is 7% ± 5.4% for the lowest interspecific values (K2P: 0.125 ± 0.051 versus 0.084 ± 0.059 subs./site). Correspondingly, the overlap between intra- and interspecific variability is also artificially small for mean values. For example, for Coleoptera the overlap zone is only 1.5% to 7.2% based on mean values, whereas it is 0.2% to 7.2% for smallest interspecific uncorrected distances (K2P: 0.016 to 0.076 versus 0.002–0.076 subs./site). For the smaller zone based on mean values, 27% of all pairwise congeneric uncorrected distances fall into the interval, whereas it is 40% for the wider zone based on smallest interspecific values (K2P: mean inter: 25%; smallest inter: 37%).

Qualitatively, the results are similar for all major taxa of Metazoa (Table 1). Quantitatively, it appears that the different ways to compute the barcoding gaps yield more congruent results for vertebrates than for invertebrates, but it is probably premature to discuss additional taxon-specific differences given that despite 20 years of sequencing COI, the taxon coverage remains very poor and uneven across Metazoa. For example, fewer than 2% of all described species for all four megadiverse orders of insects have been sequenced, whereas the coverage is approximately 10% for birds.

These differences in the size of barcode gaps based on mean versus smallest interspecific distances have major implications for identifying query sequences using DNA barcoding. For example, BOLD will identify an unidentified sequence to species when it has a match within 1% of an identified barcode in the database (Ratnasingham and Hebert, 2007). This appears reasonable based on mean interspecific values but the overlap zone based on smallest interspecific distances clearly indicates that a query with a <1% uncorrected distance to a barcode in a database has a fair chance of being interspecific. This is confirmed by an inspection of our data set that reveals that, for example, for beetles 13% of all congeneric species have an allospecific match below this threshold (K2P: 11%; online Supplementary Material 2, http://www.systematicbiology.org). For other Metazoa groups the proportion of species with such a match ranges from 7% to 26% (Table 1; online Supplementary Material 3, http://www.systematicbiology.org). This can lead to misidentifications in BOLD, because BOLD will incorrectly assume that a <1% uncorrected distance between a query and an identified DNA barcode in the database means that they are conspecific. Yet, depending on which group of Metazoa is considered, a <1% uncorrected distance has a 7% to 26% chance of being interspecific and BOLD may thus assign the incorrect species name.

Proper measures of interspecific distances are not only important for distance-based identification techniques. DNA barcodes with unusually large distances to putatively conspecific sequences are often also used to predict the existence of cryptic species (Hebert et al., 2004a; Hogg and Hebert, 2004; Armstrong and Ball, 2005; Ball et al., 2005; Barrett and Hebert, 2005; Janzen et al., 2005; Lambert et al., 2005; Smith et al., 2005, 2006; Ward et al., 2005; Cywinska et al., 2006; Hajibabaei et al., 2006a; Clare et al., 2007; Seifert et al., 2007). But what constitutes a large distance? The answer is often obtained by consulting the mean interspecific distances for congeneric species. However, a new species is not recognized based on the mean difference to its congeneric species. Instead, if one wanted to predict cryptic species based on distance, one would have to use the smallest interspecific distance.

A hitherto unnoticed drawback of using the mean instead of the smallest interspecific, congeneric distances for quantifying the barcoding gap is that the difference between the two metrics increases with taxon sampling. As a genus is more exhaustively sampled, the observed mean interspecific distances will converge onto the true mean for the genus, whereas a denser taxon sample will generally decrease the smallest observed interspecific distance for a species. This is due to the fact that with denser sampling, each species is more likely to be matched with its closest relative. We tested these predictions based on the 1001 genera in our dataset that are represented by more than two species and indeed the smallest interspecific distances decrease with the number of species sampled (r = –.12, P < .0001), whereas the mean interspecific distances are not correlated with sampling intensity (r = .06, P > .05). Not surprisingly, the difference between the smallest and the mean interspecific distances increases significantly with the number of congeneric species sampled (Fig. 1) and the mean interspecific distances are thus an increasingly poor estimator for the smallest interspecific distances. Yet, it is the decrease in the smallest interspecific distances that correctly reflects that species identification gets more difficult as the number of species that need to be distinguished increases.


Figure 1
View larger version (70K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Figure 1 The number of sampled congeneric species is significantly correlated with the difference between mean and smallest interspecific distances (r = .58, P < .00001).

 
We would like to conclude our point of view with properly acknowledging the drawbacks of using GenBank data (see e.g., Meier et al., 2006) and the well-justified criticism of distance-based methods for species determination. Especially, with regard to the use of distances in taxonomy and systematics, one could argue that there is no longer a need for additional discussion given the extensive criticism in the literature (Ferguson, 2002; Lee, 2004; Moritz and Cicero, 2004; Will and Rubinoff, 2004; DeSalle et al., 2005; Mallet et al., 2005; Meyer and Paulay, 2005; Prendini, 2005; Cognato, 2006; Hickerson et al., 2006; Little and Stevenson, 2007) and the emerging evidence for different rates of evolution in different climate zones and taxa (Pinceel et al., 2005; Wright et al., 2006). However, we believe scrutiny is still needed given that problematic metrics are being widely used and implemented in software such as "The Barcode of Life Data System." We hope that the large amount of empirical evidence presented here will convince the users of DNA barcoding to choose smallest over mean interspecific distances for computing barcoding gaps.


    Acknowledgments
 Top
 Acknowledgments
 References
 
We are grateful for the helpful comments by two anonymous reviewers and the editors that helped with improving the clarity of the manuscript. R.M. would like to acknowledge financial support from grants R-154-000-256-112 and R-154-000-270-112 of the Ministry of Education in Singapore.


    References
 Top
 Acknowledgments
 References
 

    Armstrong K. F., Ball S. L. DNA barcodes for biosecurity: Invasive species identification. Philos. Trans. R. Soc. Lond B (2005) 360:1813–1823.[Abstract/Free Full Text]

    Ball S. L., Hebert P. D. N., Burian S. K., Webb J. M. Biological identifications of mayflies (Ephemeroptera) using DNA barcodes. J. North Am. Benthol. Soc. (2005) 24:508–524.[CrossRef]

    Barrett R. D. H., Hebert P. D. N. Identifying spiders through DNA barcodes. Can. J. Zool. (2005) 83:481–491.[CrossRef]

    Check E. Cowrie study strikes a blow for traditional taxonomy. Nature (2005) 438:722–723.[Web of Science][Medline]

    Clare E. L., Lim B. K., Engstrom M. D., Eger J. L., Hebert P. D. N. DNA barcoding of Neotropical bats: Species identification and discovery within Guyana. Mol. Ecol. Notes (2007) 7:184–190.[CrossRef][Web of Science]

    Cognato A. I. Standard percent DNA sequence difference for insects does not predict species boundaries. J. Econ. Entomol. (2006) 99:1037–1045.[Web of Science][Medline]

    Cognato A. I., Caesar R. M. Will DNA barcoding advance efforts to conserve biodiversity more efficiently than traditional taxonomic methods? Front. Ecol. Environ. (2006) 4:268–273.[CrossRef]

    Cywinska A., Hunter F. F., Hebert P. D. N. Identifying Canadian mosquito species through DNA barcodes. Med. Vet. Entomol. (2006) 20:413–424.[CrossRef][Web of Science][Medline]

    Dasmahapatra K. K., Mallet J. DNA barcodes: Recent successes and future prospects. Heredity (2006) 97:254–255.[CrossRef][Web of Science][Medline]

    DeSalle R., Egan M. G., Siddall M. The unholy trinity: Taxonomy, species delimitation and DNA barcoding Philos. Trans. R. Soc. Lond B (2005) 360:1905–1916.[CrossRef]

    Ferguson J. W. H. On the use of genetic divergence for identifying species. Biol. J. Linn. Soc. (2002) 75:509–516.[CrossRef][Web of Science]

    Hajibabaei M., Janzen D. H., Burns J. M., Hallwachs W., Hebert P. D. N. DNA barcodes distinguish species of tropical Lepidoptera. Proc. Natl. Acad. Sci. USA (2006a) 103:968–971.[Abstract/Free Full Text]

    Hajibabaei M., G.A.C. Singer, D.A. Hickey. Benchmarking DNA barcodes: Does the DNA barcoding gap exist? Genome (2006b) 49:851–854.[Medline]

    Hebert P. D. N., Cywinska A., Ball S. L., deWaard J. R. Biological identifications through DNA barcodes. Proc. R. Soc. Biol. Sci. B (2003a) 270:313–321.[Abstract/Free Full Text]

    Hebert P. D. N., Penton E. H., Burns J. M., Janzen D. H., Hallwachs W. Ten species in one: DNA barcoding reveals cryptic species in the neotropical skipper butterfly Astraptes fulgerator. Proc. Natl. Acad. Sci. USA (2004a) 101:14812–14817.[Abstract/Free Full Text]

    Hebert P. D. N., Ratnasingham S., deWaard J. R. Barcoding animal life: Cytochrome coxidase subunit 1 divergences among closely related species. Proc. R. Soc. Biol. Sci. B (2003b) 270:S96–S99.[Abstract/Free Full Text]

    Hebert P. D. N., Stoeckle M. Y., Zemlak T. S., Francis C. M. Identification of birds through DNA barcodes. PLoS Biol. (2004b) 2:1657–1663.[Web of Science]

    Hickerson M. J., Meyer C. P., Moritz C. DNA barcoding will often fail to discover new animal species over broad parameter space. Syst. Biol. (2006) 55:729–739.[Abstract/Free Full Text]

    Hogg I. D., Hebert P. D. N. Biological identification of springtails (Hexapoda: Collembola) from the Canadian Arctic, using mitochondrial DNA barcodes. Can. J. Zool. (2004) 82:749–754.[CrossRef]

    Janzen D. H., Hajibabaei M., Burns J. M., Hallwachs W., Remigio E., Hebert P. D. N. Wedding biodiversity inventory of a large and complex Lepidoptera fauna with DNA barcoding. Philos. Trans. R. Soc. Lond B (2005) 360:1835–1845.[Abstract/Free Full Text]

    Lambert D. M., Baker A., Huynen L., Haddrath O., Hebert P. D. N., Millar C. D. Is a large-scale DNA-based inventory of ancient life possible? J. Hered. (2005) 96:279–284.[CrossRef]

    Lee M. S. Y. The molecularisation of taxonomy. Invertebr. Syst. (2004) 18:1–6.[CrossRef]

    Lefebure T., Douady C. J., Gouy M., Gibert J. Relationship between morphological taxonomy and molecular divergence within Crustacea: Proposal of a molecular threshold to help species delimitation. Mol. Phylogenet. Evol. (2006) 40:435–447.[CrossRef][Web of Science][Medline]

    Little D. P., Stevenson D. W. A comparison of algorithms for the identification of specimens using DNA barcodes: Examples from gymnosperms. Cladistics (2007) 23:1–21.[CrossRef][Web of Science]

    Lorenz J. G., Jackson W. E., Beck J. C., Hanner R. The problems and promise of DNA barcodes for species diagnosis of primate biomaterials. Philos. Trans. R. Soc. Lond B (2005) 360:1869–1877.[Abstract/Free Full Text]

    Mallet J., Isaac N. J. B., Mace G. M. Response to Harris and Froufe, and Knapp et al.: Taxonomic inflation. Trends Ecol. Evol. (2005) 20:8–9.[CrossRef]

    McClellan D.A., S. Woolley. AlignmentHelper, Version 1.0 (2004) Provo, Utah: Brigham Young University.

    Meier R., Kwong S., Vaidya G., Ng P. K. L. DNA barcoding and taxonomy in Diptera: A tale of high intraspecific variability and low identification success. Syst. Biol. (2006) 55:715–728.[Abstract/Free Full Text]

    Meyer C. P., Paulay G. DNA barcoding: Error rates based on comprehensive sampling. PLoS Biol. (2005) 3:2229–2238.[Web of Science]

    Moritz C., Cicero C. DNA barcoding: Promise and pitfalls. PLoS Biol. (2004) 2:1529–1531.[Web of Science]

    Pinceel J., Jordaens K., Backeljau T. Extreme mtDNA divergences in a terrestrial slug (Gastropoda, Pulmonata, Arionidae): Accelerated evolution, allopatric divergence and secondary contact. J. Evol. Biol. (2005) 18:1264–1280.[CrossRef][Web of Science][Medline]

    Powers T. Nematode molecular diagnostics: From bands to barcodes. Annu. Rev. Phytopathol. (2004) 42:367–383.[CrossRef][Web of Science][Medline]

    Prendini L. Comment on "Identifying spiders through DNA barcodes." Can. J. Zool. (2005) 83:498–504.

    Ratnasingham S., Hebert P. D. N. BOLD: The Barcode of Life Data System. Mol. Ecol. Notes (2007) 7:355–364. http://www.barcodinglife.org.[CrossRef][Web of Science][Medline]

    Roe A. D., Sperling F. A. H. Patterns of evolution of mitochondrial cytochrome c oxidase I and II DNA and implications for DNA barcoding. Mol. Phylogenet. Evol. (2007) 44:325–45.[CrossRef][Web of Science][Medline]

    Saunders G. W. Applying DNA barcoding to red macroalgae: A preliminary appraisal holds promise for future applications. Philos. Trans. R. Soc. Lond B (2005) 360:1879–1888.[Abstract/Free Full Text]

    Seifert K. A., Samson R. A., deWaard J. R., Houbraken J., Levesque C. A., Moncalvo J.-M., Louis-Seize G., Hebert P. D. N. Prospects for fungus identification using CO1 DNA barcodes, with Penicillium as a test case. Proc. Natl. Acad. Sci. USA (2007) 104:3901–3906.[Abstract/Free Full Text]

    Smith M. A., Fisher B. L., Hebert P. D. N. DNA barcoding for effective biodiversity assessment of a hyperdiverse arthropod group: The ants of Madagascar. Philos. Trans. R. Soc. Lond B (2005) 360:1825–1834.[Abstract/Free Full Text]

    Smith M. A., Woodley N. E., Janzen D. H., Hallwachs W., Hebert P. D. N. DNA barcodes reveal cryptic host-specificity within the presumed polyphagous members of a genus of parasitoid flies (Diptera : Tachinidae). Proc. Natl. Acad. Sci. USA (2006) 103:3657–3662.[Abstract/Free Full Text]

    Sperling F. DNA barcoding. Deus et machina. Newsl. Biol. Surv. Can. (Terrestrial Arthropods) Opin. Page (2003) 22:50–53.

    Thompson J. D., Higgins D. G., Gibson T. J. CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. Nucleic Acids Res. (1994) 22:4673–4680.[Abstract/Free Full Text]

    Vences M., Thomas M., Bonett R. M., Vieites D. R. Deciphering amphibian diversity through DNA barcoding: Chances and challenges. Philos. Trans. R. Soc. Lond B (2005a) 360:1859–1868.[Abstract/Free Full Text]

    Vences M., Thomas M., Van der Meijden A., Chiari Y., Vieites D. R. Comparative performance of the 16S rRNA gene in DNA barcoding of amphibians. Front. Zool. (2005b) 2:5.[CrossRef][Medline]

    Ward R. D., Zemlak T. S., Innes B. H., Last P. R., Hebert P. D. N. DNA barcoding Australia's fish species. Philos. Trans. R. Soc. Lond B (2005) 360:1847–1857.[Abstract/Free Full Text]

    Will K. W., Rubinoff D. Myth of the molecule: DNA barcodes for species cannot replace morphology for identification and classification. Cladistics (2004) 20:47–55.[CrossRef][Web of Science]

    Wright S., Keeling J., Gillman L. The road from Santa Rosalia: A faster tempo of evolution in tropical climates. Proc. Natl. Acad. Sci. USA (2006) 103:7718–7722.[Abstract/Free Full Text]

    Zehner R., Amendt J., Schuett S., Sauer J., Krettek R., Povolny D. Genetic identification of forensically important flesh flies (Diptera: Sarcophagidae). Int. J. Leg. Med. (2004) 118:245–247.[Web of Science][Medline]


Add to CiteULike CiteULike   Add to Connotea Connotea   Add to Del.icio.us Del.icio.us    What's this?



This Article
Right arrow Extract Freely available
Right arrow FREE Full Text (PDF) Freely available
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in ISI Web of Science
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrow Search for citing articles in:
ISI Web of Science (1)
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Meier, R.
Right arrow Articles by Ali, F.
Right arrow Search for Related Content
PubMed
Right arrow Articles by Meier, R.
Right arrow Articles by Ali, F.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?