DOE Expands LANL Sequence Data Management

The DOE Office of Health and Environmental Research has announced that management of sequence data at Los Alamos National Laboratory (LANL) is now operating independently with an expanded mission and a new name: Genome Sequence DataBase (GSDB). GSDB will function both as a research resource for the specific needs of the Human Genome Project and as a service facility.

For more than 10 years, LANL maintained GenBank, an electronic database that serves as the national repository for all nucleotide sequence information, through subcontracts and interagency agreements with the National Institute for General Medical Sciences (NIGMS) and the National Center for Biotechnology Information (NCBI), both units of NIH.

Now, operating as GSDB, LANL researchers will continue to accept new direct data submissions and provide update and annotation services for sequences in their care. They will also extend their work in developing new computer tools to improve the value of genetic sequence databases to the international research community. The name GenBank will now be used exclusively by NCBI to describe the nucleotide sequence database services that NCBI will continue to provide to the scientific community.

GSDB Research and Development Activities

  • Emphasize database interoperability and remote data access by improving methods for establishing links between sequence and other databases, such as those for genetic maps.
  • Produce better data-submission and annotation tools to aid producers of bulk data as well as individual scientists.
  • Create data models that better represent biological relationships and facilitate linkage with other databases.
  • Expand the concept of electronic data publishing pioneered at LANL.

GSDB Service Facility Data-Management Activities

  • Continued electronic processing of direct submissions (normally within 48 hours) at the following e-mail addresses: for submissions and update@t10.lanl. gov for corrections and additions.
  • Collaboration with other databases to provide a unified international data-collection entity.
  • Increased emphasis on automating the quality-control process, in addition to quality control of submissions by GSDB annotation and review staff.
  • Increased emphasis on online submission and maintenance. For the growing portion of the research community on the Internet, GSDB will stress the use of online data submission tools such as the Annotator's WorkBench (AWB) over batch-submission tools like Authorin.
  • Renewed emphasis on remote database access through continued support for relational satellite copies of the LANL database as well as direct, Sybase client-server access for remote queries in standard query language.

GSDB Relationship to Other Databases

GSDB is committed to productive and complementary interactions with other sequence databases such as those at the DNA Data Bank of Japan (DDBJ), the European Molecular Biology Laboratory (EMBL), and NCBI. DOE has been engaged in discussions regarding future relationships between GSDB and other databases to ensure that GSDB activities complement those at other sites to produce improved services for the user community. Data submitted to GSDB are being made available to other sequence databases as soon as processing is complete. Also, data submitted to other databases are incorporated into GSDB as the data become available.

Interoperable Information Resources

For more than a year, DOE has been carrying out an extensive review of its informatics activities in support of genome and structural biology research. Advice and comments from reviewers and the community have emphasized the need for improved, integrated information resources [see HGN 5(3), 1-4 (September 1993)]. The report stated that interoperability among crucial databases is essential and noted that current databases are unable to answer simple queries requiring integration of map, sequence, and other biological data.

Following advice from the report and elsewhere, DOE determined that it must develop and support an integrated information infrastructure for genome and structural biology research. DOE also resolved that major database elements in the integrated infrastructure should emphasize direct access through networked application programming interfaces and allow direct online data submission, annotation, and curation by the research community.

The database component should be both a research project and a production service supporting ongoing biological research, with the research project undertaking development of better data models and direct online tools for data submission and curation and for federated data access.

In the short term, nucleotide data-resource development supported by DOE will take advantage of the specific expertise, facilities, and capacity developed at LANL during its long tenure as a leading U.S. site for nucleotide database development. Over time, DOE nucleotide data resources will undoubtedly evolve in accordance with the developing integrated infrastructure (a 'center without walls') and will be subjected to extensive peer review and competitive evaluation.

Historical Role of Los Alamos

In the 1970s, Walter Goad established the Los Alamos Sequence Database, a pioneering effort at LANL that in 1982 evolved into the GenBank project. LANL continued to expand and build the database in collaboration with the firm Bolt, Beranek, and Newman under funding provided by NIGMS and other federal agencies. In 1987 LANL continued to be the site of database design and maintenance, working with IntelliGenetics.

In 1992, NIH transferred its management control for the GenBank project from NIGMS to NCBI at the National Library of Medicine. At that time, DOE and NCBI entered into an Inter-Agency Agreement (IAA) so that LANL could provide assistance in processing direct submissions for NCBI. The IAA noted, 'For nine years, LANL has been responsible for the design and management of gene sequence data as part of the GenBank project. . . . In the most recent re-competition, all three proposals which were in the competitive range included LANL as a subcontract for the direct data submission component of the project. Thus, LANL was recognized not only for its past experience in establishing the procedures for collecting and managing biological data, but for its innovative approaches in handling data prior to or independent of the publication process.'

Now, NCBI has developed its own capacity for processing direct submissions, freeing LANL to develop new approaches, tools, and services targeted specifically for the genome community.

