Sponsored by the U.S. Department of Energy Human Genome Program
Human Genome News Archive Edition
Human Genome News, September 1994; 6(3):1
As technology improves and information accumulates exponentially, continued progress in the Human Genome Project will depend increasingly on the development of sophisticated computational tools and resources to manage and interpret data.Various systems now manage data relevant to genome research;these systems range from highly specialized databases supporting local research projects to general databases serving the entire international community both as repositories and analysis resources that guide ongoing research.
Public databases containing the nucleotide sequences of the complete human genome and of selected model organism genomes will be a major product of the Human Genome Project. The ease with which researchers can retrieve and use the data from these and other related databases will provide one measure of the project's success.
Although much progress has been made in database development and operation, many challenges remain in collecting, organizing, storing, and distributing data. As maps and sequences accumulate and the focus shifts from data generation to analysis, new challenges will arise. Some feel that a key task will be to link the various biological databases into a loosely coupled distributed alliance so researchers around the world can explore all relevant facets of a particular topic. Research and development for these interoperable databases demand the close interaction of biologists with mathematicians, software engineers, and programmers to develop the needed software, database tools, operational infrastructure, and algorithms.
Four major nucleotide sequence databases now store almost 200 million bp representing human and more than 8000 other species. The four are GenBank (regd. TM) and Genome Sequence Data Base (GSDB) in the United States, European Molecular Biology Laboratory (EMBL) Nucleotide Sequence Database, and the DNA DataBank of Japan (DDBJ). Each group collects a portion of the total sequence data reported worldwide, often processing submissions and update requests within 48 hours. Because they exchange new and updated sequences frequently-usually daily-the databases contain the same sequences, each in its own format. All four of these evolving databases are working to improve data design and quality.
Database growth is accelerating rapidly, with more than half the sequences having been added in the last 2 years. This number is expected to rise dramatically within the next decade to about 10 billion bp. As more genes are identified and sequenced and understanding of sequence data improves, databases will play an increasing role in capturing new knowledge and making it accessible.
Sequence Database History
The Los Alamos Sequence Library was established in 1979 at the DOE Los Alamos National Laboratory (LANL) to store DNA sequence data in electronic form. At about the same time, database activities were beginning at EMBL, and discussions began in 1982 on collaborations between the two institutions. From these early days, EMBL and LANL agreed that any data submitted to or entered by one group would be forwarded immediately to the other, thus avoiding duplication of effort. As other sequence databases joined the collaboration, data sharing was extended to them.
The LANL data library evolved into the database GenBank when in 1982 Bolt, Beranek, and Newman (BBN) became the primary contractor for distribution of data and user support. LANL became a subcontractor to BBN, providing data collection and design. Sequence data activities at BBN and LANL were funded by the NIH National Institute of General Medical Sciences (NIGMS) with support from DOE and other agencies. At the end of the first 5-year contract, IntelliGenetics became the primary contractor, with LANL again as subcontractor charged with designing and building the database.
With the conclusion of the second 5-year contract in October 1992, NIGMS transferred its responsibility for the GenBank project to the National Center for Biotechnology Information (NCBI) at the National Library of Medicine. NCBI, directed by David Lipman, had been created in November 1988 to develop automated information systems for supporting biotechnology and molecular biology and to conduct basic research in computational molecular biology. LANL continued to handle all direct GenBank submissions and updates through a DOE-NIH interagency agreement.
In August 1993 the data resources at LANL and NCBI became independent of each other, with both providing collection and distribution services and continuing to enhance access and usability of the databases. The GenBank database remained at NCBI, and the LANL service led by Michael Cinkosky took the new name of Genome Sequence Data Base (GSDB). GSDB recently moved to the National Center for Genome Resources (NCGR) in Santa Fe, New Mexico, which focuses on the development of resource projects to support public and private genome research. The EMBL Nucleotide Sequence Database was established in 1982 as the EMBL Data Library in Heidelberg, Germany. Directed by Graham Cameron, the database is now maintained and distributed by the European Bioinformatics Institute (EBI), a new EMBL outstation at
Hinxton Hall near Cambridge, U.K. The Sanger Centre and the Medical Research Council's Human Genome Mapping Program Resource Centre are also located at the site. In addition to the Nucleotide Sequence Database, EBI maintains and distributes the SWISS-PROT Protein Sequence Database in collaboration with Amos Bairoch (University of Geneva) as well as more than 30 other specialty databases.
DDBJ was created in 1984 and began operating independently in 1986 with the sponsorship of the Japanese Ministry of Education, Science, and Culture and representative Japanese molecular biologists. Directed by Yoshio Tateno, DDBJ is accumulating nucleotide sequence data, mostly submitted from Japanese researchers. In addition to the sequence database, DDBJ makes available 15 other databases through its Gopher system.
Data Sources and Submission
During the first few years of database operation, all data were collected by scanning published articles for DNA or RNA sequence data, which were then typed into a computer and distributed in both electronic and printed form. The increase in data, however, soon began to overwhelm data processors, causing a delay between publication and appearance in the database. [Growth in the world's collection of nucleotide sequence data is shown as the number of bases contained in every release of GenBank from 1 through 82. The numbers at the tops of the dotted lines show years (which do not necessarily coincide with a particular number of releases.) The shaded bar in the middle represents the period in the mid-1980s when the data volume was, for a time, more than the databases could handle.] Also, some researchers became concerned that much sequence data would never be published because journals began limiting the amount they would print, andauthors left out the sequences they considered less important.
To alleviate these delays and problems, workers at EMBL, LANL, and IntelliGenetics developed an electronic data-publishing approach and encouraged authors to deposit sequences directly into databases before submitting their results for journal publication. Most journal editors now require such prior submissions, although an author may request that the data not be released until the article appears in print.
Nearly all data are now acquired through direct submissions to one of the four databases, where they are received, processed, and shared with the others. Groups generating large volumes of data can arrange a procedure with the databases to simplify submissions. A small amount of data still enters the databases via journal scanning, done at NCBI, and this is also shared immediately. Under a contract with the European Patent Office, EMBL draws on sequence data reported in the patent literature since 1960. NCBI captures corresponding data from U.S. patents, and DDBJ Release 18 (July 1994) contains 4551 entries processed by the Japan patent office.
The most commonly used direct-submission mechanism is the Authorin automatic-processing program. Developed by IntelliGenetics, Authorin guides users through the process of entering sequence and providing biological and bibliographical annotation. Author in software can be obtained on diskettes from NCBI [see article entitled, "Database Distribution and Access Details", HGN 6(3):4] or by ftp from the databases and many online repositories of biological software. Early next year, NCBI will offer an alternative to Authorin with a new point-and-click-style program called Sequin, which will have interface improvements for both stand-alone and network users. Network users will have live access to GenBank and MEDLINE for more complete annotation of their sequences.
GSDB uses the Annotator's WorkBench (AWB) software, which allows offsite users with Internet access to have full and immediate control over data submission, annotation, and release, thus offer ingan advance over batch-mode submission. Offsite users who are unable to run the graphical version of AWB are issued accounts on one of GSDB's machines, from which they can run AWB 2.x. Also, centers with in-house Sybase expertise can write special applications that perform updates directly on the master GSDB database using client-server access.
Important sources of direct submissions to NCBI include numerous expressed sequence tags (ESTs), which are partially sequenced cDNAsthat are stored in the dbEST database. To facilitate access to and simplify comparisons of sequence tagged sites (STSs) with sequences in other divisions, NCBI recently created a separate database (dbSTS) that provides detailed information about STS map locations and polymerase chain reaction conditions.
EMBL has established submission accounts for groups producing large volumes of nucleotide sequence data over an extended period, a procedure that has proven flexible and efficient both for database staff and a number of genome research groups. DDBJ recently developed a test version of a relational database system on Sybase for large data submitters such as human and rice EST projects.
Database Use and Access
At present, users of sequence databases typically want to retrieve records based on sequence similarity which can offer clues to gene sequence functionsor on keywords. For sequence similarity searching, computer programs are used to compare a query sequence with a subset of the database and find statistically meaningful alignments. To retrieve by other criteria such as keywords, gene name, or gene product function, users can search text descriptions. Databases are increasingly important for facilitating gene searches and comparing annotated information to detect sequence relationships that have not been determined experimentally.
Each sequence database record corresponds to a continuous piece of DNA, the largest of which is about 685 kb from the human T-cell receptor. Typical database entries contain in flat-file formata concise sequence description; the taxonomic description of the source organism; bibliographic information; a table of features listing the locations of biologically significant elements such asprotein-coding regions, transcription units, mutations, or modifications; and protein translations of coding regions. Each entry is curated by database staff, who check for biological consistency (e.g., coding sequences should not contain "stop" codons). When appropriate, entries may also be cross-referenced toother databases; for example, EMBL has established cross references for SWISS-PROT, Eukaryotic Promoter Database, Transcription Factor Database, and FlyBase.
A variety of methods are used to distribute and access these databases, including magnetic tapes, CD-ROMs, e-mail, and Internet. Data is now accessible through information services such as WWW, Gopher, and WAIS (Wide Area Information Server).
EMBL Data Library
The EMBL sequence database is available via network services and European Molecular Biology Network nodes (EMBnet; 19 sites). EMBL, SWISS-PROT, and a number of other databases distributed by EBI are accessible via EBI network servers and included in quarterlyCD-ROMs. For querying the sequence databases, EMBL-Search for Macintosh or CD-SEQ for DOS is supplied with the CD-ROMs. Sequence databases are provided in the format for use with software such as FastA on Macintosh or MS-DOS systems, and EMBL Scan is supplied for rapid searching for very similar sequences.
EBI Network Fileserver enables access via e-mail to the full collection of databases, public-domain software, and documentation maintained by EBI (see article entitled, "Database Distribution and Access Details"). For molecular biology databanks, the EBI WWW server will soon offer the SRS network browser. SRS will allow interactive querying of the EMBL Nucleotide Sequence, SWISS-PROT, PIR, and NRL3D databases, with hyperlinks to cross-referenced entries in several specialized molecular biology databases distributed by EBI.
GSDB emphasizes online, networked data access, offering a WWW server for individual users and a fully functional relational database for other developers. Entries in its WWW server are hyperlinked to an array of external sources, including Genome Data Base, SWISS-PROT, and the Enzyme Commission catalog maintained on the WWW server at Johns Hopkins University (JHU). An online relational server continuing the GSDB database is available at NCGR and many satellite sites around the world. Anyone with a Sybase front-end license may access a read-only copy of GSDB at NCGR using either generic database-access tools or special-purpose programs. GSDB may also be searched usingthe GenQuest system [see HGN 6(3):6].
NCBI has introduced several new services to facilitate public use of GenBank, including the retrieve and BLAST e-mail servers and WWW access to sequence and bibliographic data. Since 1992, NCBI has offered access to gene and protein sequences and related MEDLINE bibliographic citations via a graphical user interface. A combination of the three integrated databases and retrieval software Entrez is available on CD-ROM, as an Internet client-server application, and via WWW [see article entitled, "Database Distribution and Access Details", HGN 6(3):4]. The power of Entrez is in the precomputed links among the constituent databases; these links allow users to retrieve a DNA sequence by searching for text terms, author name, or accession number and to look up associated protein and MEDLINE citations by clicking a button. BLAST sequence similarities have also been precomputed for every DNA and protein sequence.
GenQuest Sequence Analysis Service
GenQuest is a sequence analysis service that acts as "middleware" connecting the user with many databases and integrating networked resources into one easy-to-use system. With a Mosaic interface created through a collaboration between ORNL and JHU, the GenQuest graphical client sends a properly formatted request to the Oak Ridge National Laboratory (ORNL) online server. The server uses FastA, BLAST, or full Smith-Waterman to analyze several databases (e.g., GSDB, SWISS-PROT, and PDB), and search results are returned quickly as a standard Mosaic page with hot links established to all referenced data objects in the report.
This integrated resource is possible because the online ORNL GenQuest server can return analyses quickly enough to support areal-time interface; also, all pertinent databases provide data with standard network-accessible protocols. The GenQuest service is an example of how distributed information resources can be combined easily and economically into important tools for the community.
The electronic form of the newsletter may be cited in the following style:
Human Genome Program, U.S. Department of Energy, Human Genome News (v6n3).
The Human Genome Project (HGP) was an international 13-year effort, 1990 to 2003. Primary goals were to discover the complete set of human genes and make them accessible for further biological study, and determine the complete sequence of DNA bases in the human genome. See Timeline for more HGP history.
Published from 1989 until 2002, this newsletter facilitated HGP communication, helped prevent duplication of research effort, and informed persons interested in genome research.