Bioinformatics and Advanced DNA Technology
Preamble
Over the five years of its operation, the Canadian Barcode of Life Network has a number of important research goals, but its primary mission involves the development of a DNA barcoding system to aid the identification of the most ‘important’ animal species in Canada. The success of this mission will provide a role model that can be implemented for other domains of eukaryotic life. If DNA barcoding is going to gain broad acceptance as an identification tool, its application must be simple, robust and inexpensive. We intend to meet these goals through the efforts of our Bioinformatics and Advanced DNA Technology groups. The activities of our Advanced DNA Technology Group (ADTG) will, in part, be directed towards ensuring that the Canadian Barcode of Life Network researchers, especially those who are not currently involved in DNA analysis, have access to a core analytical facility that delivers high quality results promptly and cost-effectively. This facility will also offer basic support in the analysis and interpretation of DNA data, providing a service analogous to that offered by Statistics Departments on many university campuses. Finally, the ADTG will take a lead in resolving methodological problems and in leading the adoption of new technologies that either speed sequence analysis or lower its cost. Our Bioinformatics group will develop the web-based infrastructure needed to ensure that DNA barcodes serve as the basis for an effective identification system. As such, they will assemble the data generated through the Canadian Barcode of Life Network research, create the analytical infrastructure needed to speed taxon diagnosis and develop a public web portal enabling access to this information. As a result of the Canadian Barcode of Life Network’s activity and similar initiatives in other nations, there will be a flood of sequence information on the COI gene. This information promises to provide important new insights into the modulators of molecular evolution. To ensure that adequate attention is paid to the broader implications of these results, we have established a final group that will be charged with exploring these issues.
TopAdvanced DNA Technology
Our Advanced DNA Technology Group will play a critical role in methodological innovation. In the first instance, it will develop optimal protocols for handling and archiving a range of biological samples and for the PCR amplification of target sequences from them. In the following sections we discuss the approaches that we will employ to resolve key issues ranging from specimen validation to accelerated protocols for sequence characterization.
Voucher Specimens: The specimens employed in our analyses will, in essence, be the “DNA type” material for their species. As a result, it is critical that these specimens be properly identified and that they be deposited in a museum setting that will ensure their long-term security. Because our pilot projects have shown that it is possible to recover enough DNA for COI analysis from a single leg of even the smallest insects, we will ordinarily be able to deposit specimens adequate for taxonomic characterization. To ensure the short-term security of specimens that are submitted for processing, the ADTG will make use of archival storage capabilities available through the Biodiversity Institute of Ontario. As well, we have negotiated for the long-term curation of the specimens generated through our work by the Canadian Museum of Nature and the Canadian National Collection. In other cases, our work will be based on specimens held in existing collections, such as those at the Royal Ontario Museum and those of the Canadian Forest Service. In all cases, care will be taken to connect the accession numbers for these specimens with our newly collected DNA data. We will also assemble collateral information on each specimen, including the date and location of its collection and, where appropriate, digital images of it. This information will be used to create a ‘specimen’ page for each organism that we analyze (see examples at www.barcodinglife.uoguelph.ca using ‘barcode’ as a user ID and password).
Specimen Identification: Most of our research teams include personnel with sufficient taxonomic expertise to enable at least tentative species identifications. In some cases, they also include individuals who are the leading taxonomic authority in the group under study. When this is not the case, we will reach out for taxonomic assistance. Within Canada, we have received offers of support from leading taxonomists based within the Canadian Museum of Nature and Agriculture & Agri-Food Canada. Where Canada lacks top expertise, we will seek it internationally. We have received offers of support in this regard from individuals based at organizations such as the National Museum of Natural History (Smithsonian Institution) and the American Museum of Natural History.
DNA Extraction: Museum collections represent a primary source for well-identified specimens. However, because these specimens were neither preserved nor curated with DNA analysis in mind, there are significant challenges in retrieving large molecular weight fragments of DNA from them. Our pilot work has established protocols that are 100% successful in recovering DNA from dried specimens up to 15 years old. However, it has not been possible to generate full-length PCR products of COI-5’ from older specimens with current primers. We have begun to resolve this problem that arises from the degradation of DNA into shorter fragments through the generation of a series of internal PCR primers. As this work has been successful in recovering COI-5’ sequences from specimens up to 30 years of age, we are confident that the ADTG will be able to lead the development of further protocol improvements. Published reports, have for example, demonstrated the recovery of mitochondrial DNA sequences from century old insect specimens and from formalin-preserved material. Our work will focus on both the improvement of existing protocols and on tests of their efficacy for different groups of organisms.
PCR Amplification: “Universal” primers are currently available that amplify the COI-5’ region from a wide variety of animals (Folmer et al. 1994). However, these primers do not work reliably on certain groups. Using sequence information in GenBank, we have begun the task of designing new primers to enable the recovery of COI-5’ from all animal groups. Our preliminary successes in this regard indicate that we will be able to accomplish this goal. In addition, we will need to design new primers to enable the recovery of COI from other major groups of organisms such as plants, fungi, and protists. In cases where sequence variation in the COI gene does not provide sufficient resolution to distinguish species, the ADTG will explore the use of other genes for DNA barcoding, and develop primers and protocols suitable for their use in the appropriate groups.
DNA Sequencing: The ADTG will operate a core DNA sequencing facility that will be based in the Biodiversity Institute of Ontario (BIO) which will operate in a new building at the University of Guelph (construction to be complete by January 2005). Two Beckman CEQ8000 capillary sequencers are currently available to the ADTG and these are each capable of processing 96 samples per day. Thus, we have the capacity to analyze up to 24,000 sequences per instrument per year (96 x 5 days x 50 weeks). We also have a CFI/OIT award with which we can purchase additional sequencers that will enable us to double our capacity if this is required. We emphasize that the ADTG sequencing facility will be available to all members of the Canadian Barcode of Life Network, but that its use will be discretionary. However, we do expect that, because of the high sample volume, this facility will be able to offer low-cost access to high quality data.
Towards Automation: At present, all DNA extractions and subsequent PCR reactions are carried out manually so that a single technician can process just 50 specimens a day. We plan to employ a robotic system to increase production by an order of magnitude. We emphasize that once the design parameters for this robotic system are established, the $400K required for its purchase are available through an existing CFI/OIT award.
Microarray Analysis: Although most eukaryotes are large enough to be handled as individuals, this is often not the case for the smaller organisms such as fungi and protists. These organisms regularly occur as mixed species assemblages in soil and water that prevent the separate analysis of individuals. Recent studies have established that microarray-based diagnostic approaches are valuable in these situations. Their feasibility has already been demonstrated in fungi (Levesque et al. 1998), nematodes (Uehara et al. 1999) and bacteria (Fessehaie et al. 2003). These microarrays include a collection of short oligonucleotides (16-28 bases in length) that match a particular gene. The location of these oligonucleotides within the gene can be chosen so that they are specific to different taxonomic levels such as species, genus or family. The region of the gene corresponding to the oligonucleotides on the microarray is PCR amplified and fluorescently labeled from DNA extracted from an “unknown” sample, and then hybridized to the microarray. The labeled DNA only binds to oligonucleotides that are identical in sequence. Matches are identified by quantifying the level of fluorescence on each oligonucleotide.
Oligonucleotides for microarrays and for PCR amplification will be designed from a COI database generated through this project using software developed to identify oligonucleotides of various specificity levels from large sequence databases (LifeIntel Software Inc., Port Moody, BC). Signature Oligo generates a spreadsheet containing sequences that are appropriate for the design of specific oligonucleotides (see http://www.lifeintel.com/news-2002-03-29.html). This software can process a large number of individual sequences or sequences can be grouped into folders. The software can identify oligonucleotides that are conserved across all sequences within a folder, but do not match any sequences outside of that folder. This approach can be used to design species-specific oligonucleotides when there is intraspecific variation or to design oligonucleotides that are diagnostic for higher taxonomic groups (genus, family, etc.). The spreadsheet generated by Signature Oligo is then used as the input for Array Designer (PRIMIER Biosoft International, Palo Alto, CA), which adjusts the length of the oligonucleotides to optimize their performance on microarrays or as PCR primers.
Personnel: This group will be led by Teresa Crease (Guelph) who has five years experience in the oversight of sequencing facilities gained through her role as Director of the Core Sequencing Facility at the University of Guelph. The Canadian Barcode of Life Network expects to provide funding for one PDF and two technicians to aid in the support of the ADTG. The PDF will supervise the facility and will lead the development of new protocols, while the technicians will focus their efforts on sequence analysis. Our work on microarrays will be led by André Lévesque who has had a long involvement in this area of research. The Canadian Barcode of Life Network will provide 2 years of support for a PDF to advance this work.
TopThe Barcode of Life Database
Because of our intensive involvement in the development of DNA-based identification systems, the Canadian Barcode of Life Network will be well positioned to lead the development of a web-based system that delivers biological identifications. We believe that there is merit in developing this Barcode of Life Database (BoLD) as a stand-alone enterprise, at least initially. In the longer term, as interest in DNA-based identification systems grows, we expect to surrender administration of this enterprise to any organization tasked to the oversight of biological identification systems, either nationally or internationally.
We emphasize that all sequences gathered by the Canadian Barcode of Life Network researchers will also be submitted to GenBank in a timely fashion and that access to the BoLD website will be freely available. We note, as well, that our plans to develop a substantive database will be greatly aided by infrastructure provided by recent CFI/OIT awards. In particular, we point to the impending acquisition of a $2M Storage Array Network by the Biodiversity Institute of Ontario at Guelph that will be available to store all of our genomic data and relevant collateral information. We note, as well, the availability of intensive processing power through SHARCNET, a $15M distributed computing network with major nodes at both Guelph and McMaster.
Our decision to create a new sequence database was not taken lightly. However, we are confident that our work will contribute an important and novel functional element to the repertoire of genomic ‘services’ provided by existing organizations. Of course, the most important of these entities is The National Center for Biotechnology Information (NCBI) based in Bethesda, Maryland. This organization serves a critical role in supporting global genomics research through its involvements in the warehousing and organization of sequence information. Although its accomplishments have been remarkable, the broad scope of its responsibilities has slowed the ability of NCBI to develop specialized resources. However, the organization is committed to extending the usage of its sequence information for specialized applications. For example, their taxonomy group not only provides detailed taxonomic placements for newly submitted sequences, but is also working to augment GenBank files in other fashions. We believe that there are important ways in which the Canadian Barcode of Life Network will be able to aid efforts at NCBI to develop an effective DNA-based identification system. We point firstly to the small amount of collateral information that accompanies most GenBank records. GenBank did recently add the capability to link to external image files of organisms and to detailed location data, but these fields are rarely submitted. By contrast, these fields will be a required item for all entries submitted by the Canadian Barcode of Life Network researchers to the Barcode of Life Database (BoLD). We will also require, whenever possible, that sequence submissions be linked to a voucher specimen that has been deposited in a major museum. This is not the case for GenBank submissions, which means that it is ordinarily impossible to check problematic entries.
Because it provides such broad genomic coverage, NCBI has also been unable to fully develop “value added” resources to aid usage of the sequence information that it holds. For example, GenBank currently has some 8000 COI records for members of the animal kingdom. However, comparisons of these COI sequences are complicated because they derive from different regions of the gene and even their 5’ to 3’ orientation is not standardized. As a result, any effort to focus analysis on a single gene region is a slow process, which requires the recursive extraction and alignment of sequences. the Canadian Barcode of Life Network will greatly simplify this task through its assembly of an aligned database for all COI-5’ sequences. It will further extend this information by extracting all relevant sequence information from GenBank. This aligned database will, first of all, be valuable as an identification engine. However, it will also be useful in varied other contexts, both practical and theoretical. For example, such alignments are invaluable in primer design. They also aid broad comparisons of factors such as rates of evolution that are significant in theoretical contexts.
As we move to create BoLD, we will focus attention on the following five issues relating to its structure and functionality:
Database Fields: The information collected will determine what can and cannot be done in the future. The way in which information is parsed into tables will determine the kind of generalized queries that can be made of the data. We have created a beta version of BoLD and over the summer of 2003, we will invite its critical review by both the Canadian Barcode of Life Network researchers and by colleagues with interests in DNA-based identification systems. We will use this feedback to aid development of a revised version by fall 2003. This version will be subjected to a final review, with the goal of hardening data entry fields by early 2004.
Data Entry: We plan to allow public submissions of COI-5’ data for inclusion in BoLD. This decision means that data entry must be completely transparent to anyone approaching the website. Without training, new users must be capable of following the instructions to enable data submission. Although this sounds trivial, it is actually quite difficult to accomplish and requires careful planning. Newly submitted data will flow to a server that will be separated from our main database by a substantive firewall. All data will be subjected to a validation process. Our site technician will firstly ensure that the sequence(s) do not show any gross anomalies. For example, when a sequence shows unexpected structural change (e.g. insertion, deletion), a query will be sent to the submitter. As well, when a sequence shows taxonomic misplacement, such as a purported arthropod sequence that shows close similarity to those of vertebrates, a query will be dispatched. Such vigilance is critical for BoLD to provide the highest quality identification service possible. We emphasize that all database submissions generated by our Network will require the deposition of a museum accession number and the name of the taxonomic authority who validated the identification. However, we will also include sequence data that lack full taxonomic voucher support (e.g. most GenBank sequences) although this fact will be noted in any identification generated with such data.
Data Analysis: BoLD will be designed to provide rapid and informative output. Although it will not be difficult to satisfy this criterion initially, speed will become an increasing problem as the database grows to millions of sequences and records. In this respect, a proper relational database is critical. We are using PostgreSQL, an open source relational database, for BoLD’s construction. We emphasize, as well, that programming will be required to make the web interface to this database convenient, flexible and responsive. We are using PHP and Perl CGI scripting to create our database interfaces.
Algorithm Development: Although methods are now available to generate species identifications from sequence data, existing analytical approaches do have some deficits. BLAST is the most commonly used method to rapidly place a new sequence onto an existing (pre-calculated) tree. While BLAST is simple and very rapid, it has limitations in accuracy that can cause it to fail to identify the nearest match. We have implemented a quick fix to this problem that involves a secondary evaluation of the top ten matches to ensure selection of the best match. However, there are several potential methods for reconstructing trees that would provide a superior solution to this problem. Each of these methods is capable of rapid sequence additions, but each has different properties, accuracies and advantages that need to be evaluated. Aside from identifying the algorithm that best assigns a taxonomic identity, we will direct significant effort toward the development of programs that display collateral information. For example, since our data entry program will require the submission of GPS co-ordinates, we will be able to integrate GIS information with our genomic data. This will, for example, make it possible to overlay the locality information for a newly collected specimen on existing distributional data for the species in question. We also plan to develop interfaces delivering high resolution image files of each species included in the database.
Simulation Studies: We plan to carry out some relatively simple simulation studies to examine issues important in data interpretation. For example, past studies have indicated that it will be difficult to recover a full-length COI-5’ product from many aged museum specimens. Because of this, it is important to ascertain the effect of sequence truncation on the reliability of identifications, and we will carry out simulation studies to quantify these effects.
Personnel: Development of BoLD will be led by Brian Golding (McMaster) who brings a long history of involvement in bioinformatics. Paul Hebert (Guelph) will aid in its development because of his familiarity with both sequence diversity in the COI gene and the needs of the taxonomic community. The Canadian Barcode of Life Network expects to support the development of BoLD with 1 PDF and 1 technician. The PDF will have primary responsibility for leading the simulation studies and for oversight of the database, while the technician will be responsible for screening submissions and for updating the database.
TopBarcodes and Biological Insights
This research group will direct its efforts towards the extraction of insights concerning important biological processes that can be gained from the large comparative database that will be assembled through our barcoding efforts.
Shifts in Nucleotide Composition: The COI-5’ region provides a strong indication of shifts in nucleotide composition across the mitochondrial genome. We plan studies that will probe the causes of shifts in the G+C content of the mitochondrial genome and their linkage to shifts in nucleotide composition of the nuclear genome (Wang & Hickey 2002, Singer & Hickey 2003).
Shifts in Rates of Evolution: The COI-5’ region provides a good indication of genomic-wide shifts in rates of mitochondrial evolution. Analyses of early results have, for example, revealed accelerated rates of molecular evolution in parasites and haplo-diploids. We plan broader studies of the causes and implications of this diversity.
Shifts in Protein Structure: Our data on COI sequence divergences will soon provide the most comprehensive database on sequence diversity in any gene. This information will provide an extremely interesting opportunity to investigate the mechanistic significance of shifts in the amino acid composition of COI-5’. For example, snails capable of tolerating anaerobic conditions possess a substantial insertion in one region of their COI-5’ gene. We note that comparative protein structure work on COI is likely to be particularly illuminating because of the detailed structural knowledge of this gene.
Personnel: This research program will certainly involve collaboration among four members of the the Canadian Barcode of Life Network team, but broader participation is expected. Donal Hickey (Ottawa) will lead work on nucleotide composition because of his active prior involvement in this area. Jim Ballantyne (Guelph) has had a long involvement in work on the kinetic properties of mitochondrial proteins and the structural aspects of mitochondrial membranes. Because of this background, he will lead our work on protein structure. We will be aided in this work through collaborations with Isidore Rigoutsos, Manager of the Bioinformatics and Pattern Discovery Group in IBM’s Computational Biology Center. Finally, Brian Golding (McMaster) and Paul Hebert (Guelph) will advance our work on factors influencing rates of evolution. Brian brings the analytical skills needed to examine this issue, while Paul brings a good understanding of organismic attributes. We expect to deploy one graduate student to aid in the work on nucleotide composition.