The Barcode of Life Database (BOLD) Systems
As with whole genome sequencing programs for individual species, our large-scale surveys of single genes from thousands of species will generate a torrent of data. For example, our barcoding studies on animals over the next few years alone will entail the acquisition of more than 300Mb of COI sequences, or about 10% of the size of the entire human genome. Unlike complete genome databases, our sequence records must link to a named voucher specimen, as well as to collection information, high quality images, and other collateral data. Our database must be searchable by sequence as well as by species name and higher taxonomic categories, and must have the capacity to quickly compare thousands of sequences in order to provide positive identifications. As a result, the creation of an effective Barcode of Life Database (BOLD) represents a crucial component of our overall vision for a complete barcoding system for eukaryotes.
The databasing system that we have begun to develop (www.barcodinglife.org) has already received considerable acclaim from international collaborators, and its functionality will be displayed in the press conference following the barcoding symposium in London in February 2005. Below, we highlight the key components currently available in BOLD, and outline some impending innovations and improvements. We emphasize that BOLD is far more than an online repository for COI sequences. It is, in fact, a comprehensive workbench for barcode analysis that includes three major functional elements: a laboratory information management system, a data management and analysis system, and a species identification engine. Its capabilities are critical to the advance of our work and to the broader activation of DNA barcoding.
Data Management and Analysis System (BOLD-DMAS)
The DMAS component of BOLD currently provides support for both the warehousing and analysis of the sequence records generated by facilities using our LIMS. It provides a simple interface enabling the members of each research team to upload new sequences to password-protected projects. In addition, the DMAS enables project administrators to assign varying levels of access to data for different researchers involved in a project. For example, a project leader will be able to access, edit, or delete data entries, while a graduate student may simply have the ability to add new data. This hierarchical management system allows work to proceed simultaneously in different labs and on different groups of organisms while being managed from a centralized location. This greatly improves efficiency and communication, and prevents data from being lost or duplicated. Moreover, because the DMAS includes information on where each specimen was collected and where the voucher specimen has been deposited, copies of sequence traces, and high resolution photographs of each specimen, it allows straightforward traceability of the data stream back to the source.
At present, BOLD-DMAS is used by a small but growing number of users, both in Canada and internationally. However, it was designed to operate at a much larger scale – on both a national level (i.e., within our Network), and eventually world-wide. We emphasize that all sequences gathered by the global bird and fish projects will be deposited in BOLD. As well, we expect the future development of separate BOLD-derived databases that support particular sectors. For example, CSIRO Marine Research has indicated their interest in taking responsibility for developing a fish barcode database. Similar applied databases can be envisioned for use in forestry, agriculture, and other programs that require more specialized information than that included in the general barcode database.
Aside from serving as a data repository, the DMAS component of BOLD includes a suite of analysis tools that allow rapid and simple processing of data. Sequence records, which can be submitted via a simple interface, are automatically aligned. Specimen pages are created automatically from the user-defined data, including an automated plot of GIS coordinates on high-resolution, multi-scale geographic maps made available by NASA. BOLD also includes many standard programs for assembling and exporting neighbour-joining trees, which includes colour coding to indicate taxonomic affiliation or other user-defined parameters.
While this system has proved highly effective for COI sequences from animals, it is not currently compatible with datasets involving multiple genes or non-coding gene regions. Since barcode systems for some taxonomic groups (most notably, plants) are likely to use genes other than COI, we will develop new versions of the DMAS that can accommodate records from other genes.
Species Identification System (BOLD-ID)
The first step in generating a functional DNA-based species identification system lies in the assembly of a comprehensive barcode gene sequence library, which is carried out using the LIMS and DMAS components of BOLD. The second step involves the creation of an effective ID system for the comparison and matching of sequences from new specimens to the existing sequence library. Because the primary application of BOLD for most public users will be the identification of specimens using newly-acquired barcode sequences, the databasehas been designed to make this process as user-friendly and efficient as possible. For example, BOLD includes a simple user interface to allow COI sequences to be cut and pasted into a search field and automatically compared against the existing dataset. Also, rather than employing the standard Basic Local Alignment Search Tool (BLAST) protocol to query the sequence database, BOLD makes use of Markov models based on a global protein alignment for the COI gene, which increases both the speed and accuracy of the matching procedure. Using this algorithm, BOLD returns a probability-based match profile indicating the likely identity of the source species. Links to that species’ page provide additional information about it (e.g., photographs) that can be useful in confirming the identification.Top
The Future of BOLD
i. Derivative protocols and technologies: In order to further our long-term goal of producing a hand-held, automated species identification system, we will begin development of a prototype system that identifies North American Lepidoptera (moths and butterflies). This effort will involve the development of new DNA analysis methods and computer software, but will make use of existing hardware. Specifically, we will develop a custom “lep disc” that contains the necessary primers, Taq polymerase, and other reagents required for DNA release and amplification. This will consolidate the early stages of DNA analysis, allowing a field researcher to simply touch a small tissue sample to the disc and subsequently amplify COI in a handheld PCR device. Because sequencing can not yet be done in situ, our early demonstrations will require subsequent sequencing of COI in the lab. However, once the sequence has been acquired, the procedure can return to the field. We will use a hand-held computing device equipped with a GPS module to automatically log collection location data and we will also develop a specialized version of BOLD complete with species images, sequence records, and existing distributional data that can be stored on a flash card to run on it. We believe that a working protoype of this sort will provide an impetus to the private sector for the production of an integrated system that will enable the move from specimen to ID entirely in the field. With the necessary support, we believe that our prototype system for North American lepidopterans will be operational in 2 years.
ii. Linkages with other databases: BOLD is currently being developed with cross-links to the existing GenBank database, which will greatly expand its accessibility to the genomic community. Based on this model of inter-linkage, we believe it will be possible and highly beneficial for BOLD to serve as a conduit to other non-genic databases of biological information. For example, performing a barcode-based identification with BOLD could automatically provide links to the information available in databases such as FishBase, AmphibiaWeb, the Animal Genome Size Database (to which we will contribute a large number of new data derived from our barcoding specimens), the Tree of Life Project, and many others. Automating the creation of such links as part of the entry of new barcode data represents an important challenge to be met in the future development of BOLD.
iii. Simulation studies: We will carry out simulation studies to examine issues important in data interpretation. For example, past studies have indicated that it will be difficult to recover a full-length COI product from many aged museum specimens. Because of this, it is important to ascertain the effect of sequence truncation on the reliability of identifications, and we will carry out simulation studies to quantify these effects.
Development of BOLD will be led by Brian Golding (McMaster) who brings a long history of involvement in bioinformatics. Paul Hebert (Guelph), who has familiarity with both sequence diversity in the COI gene and the needs of the taxonomic community, will aid the design parameters for the animal segment of BOLD, and a similar role will be played by the leaders of our fungal, plant and protist working groups. In addition to our current programming team (led by Sujeevan Ratnasingham), we will hire additional skilled programmers and data managers to carry out further expansion of the BOLD system, including the development of specialized applied, locally distributed, and hand-held versions.Top