HGPI

Human Genome Project Information Archive
1990–2003

Archive Site Provided for Historical Purposes


Sponsored by the U.S. Department of Energy Human Genome Program

Human Genome News Archive Edition
go to list of issues »

Human Genome News, January 1992; 3(5)

LLNL, GDB Transfer Map Data

The success of the Human Genome Project requires efficient, automated transfer of physical mapping data from laboratories to large public databases such as Genome Data Base (GDB) at Johns Hopkins University. An effort to develop such a transfer system is described below.

GDB and the physical mapping project at Lawrence Livermore National Laboratory (LLNL) have been cooperating to develop a prototype approach to automated data transfer. Physical mapping data from LLNL is now uploaded to GDB automatically and becomes part of the public release from GDB. The system will evolve through continued evaluation.

Certain generic constraints were recognized in planning the approach. Concern for the security and integrity of both the sending and receiving databases required that (1) the receiving database not be allowed unlimited access to all data in the sending database and (2) the sending database could not inject data into the receiving database without the knowledge, consent, and active participation of the recipient system's staff.

Other considerations involved minimizing adverse effects on both the sending and receiving databases: neither should have to reconfigure its internal data representations to meet the needs of the other. Efficient transfer dictated that the sending and receiving databases not be required to prepare or parse each data shipment, either manually or with ad hoc programs.

Both GDB and LLNL store data using commercial client-server-type relational database management systems (RDBMSs) running on workstation-class computers with Internet access. Although both data sets are maintained using a system from Sybase, this is a convenience but not a necessity. Database design at the two sites did not allow a direct data transfer, but addressing the generic aspects of data transfer made possible an automated system that met system constraints.

Efforts to automate the transfer of data involved the following steps:

  1. GDB and LLNL staff determined what data at LLNL would be appropriate for transfer to GDB. Data models at each site were compared to determine how LLNL data structures might map to those at GDB.
  2. An intermediate relational schema was designed with tables that could hold contig-type physical mapping data in a format consistent with GDB needs and with LLNL designs. Data in these tables describe the contigs, the cloned elements in the contigs (cosmids thus far), and all biologically interesting attributes that have been determined for these cloned elements.
  3. These tables were installed at LLNL as an adjunct database, the "GDB Interface" (GDBI). GDBI and the LLNL main mapping database are managed by the server component of the DBMS.
  4. At timely intervals, LLNL updates the tables in its GDBI database with data from its own mapping database. Although this transfer can be executed automatically as a database transaction, at present LLNL reviews all transferred data for accuracy and consistency with release policy.
  5. GDB has been given unlimited read-only permission for GDBI, to be used in any fashion and at any time. For automated transfer, however, a Sybase "Client" software package at GDB queries the LLNL system to capture GDBI data and transfer it directly via Internet into database tables managed by the GDB Sybase-server.

The main characteristics of this interface design are the following:

  • GDB has access only to GDBI and not to other LLNL databases or computer systems (i.e., no login privileges). Thus, LLNL computers and main database are insulated from uncontrolled outside access.
  • LLNL does not actively enter data into GDB. After being notified of updates via e-mail, GDB uses procedures under its control to obtain LLNL data.
  • Neither the sending nor the receiving database was required to modify its internal working data representations. However, both were obliged to cooperate in developing the appropriate intermediate interface database.
  • Both data addition by LLNL and data extraction by GDB can be accomplished automatically by invoking the appropriate routines. No manual preparation or parsing of data is required.
  • Both LLNL and GDB can modify internal data representations without "breaking" the transfer system, provided that mapping between the interface database and the new internal representations is consistent. Thus, a stable interface allows the continued evolution of both sending and receiving databases without endangering data flow.
  • Direct, transnetwork, machine-to-machine transfer is achieved, and no tapes need be prepared and shipped. With this system, LLNL contig-mapping and attribute data appears in the form GDB desires, not as LLNL sees and stores the same data. Thus GDB is insulated from LLNL conceptions of data.

The generic nature of GDBI may allow the interface tables used at LLNL to be used by other supplier laboratories to hold data for transfer to GDB. The use of equivalent interface databases would spare other laboratories from devising systems de novo and would ensure that biologically similar data are entered into GDB in a logically similar manner. The figure shows the arrangement of the system's main components.

GDBI consists of some 11,750 different data items arranged in 15 tables (presently aggregating to 3 megabytes). These data comprise some 26 contigs, 320 cosmids, and a total of 815 attribute facts about these cosmids.

The GDBI database will contain information on all contigs for which some useful biological fact has been ascertained. Biological facts about the contigs include finding positive hybridization to member cosmids of a unique sequence element (e.g., cDNA probes), in situ hybridization mapping results, and repetitive element content. Contigs are described according to their cosmid membership and the order of a minimal covering subset of elements.

Whereas this approach nominally requires that the supplier laboratory use a relational client-server DBMS and provide direct Internet access, the server need not use the same vendor as GDB. For example, the Sybase "Open-Server" software interface allows remote RDBMS servers supplied by other vendors to be accessed directly through standard relational database query language (SQL). With significantly more effort, the Open-Server package can be used to access supplier DBMSs that are neither client-server nor relational in design, as long as they afford Internet connection and an appropriate application program interface.

At tolerable cost, the Internet connection requirement can be relaxed and the telephone lines made to appear to the computers at both ends as reasonably high speed "Internet" links. A suitable interface that is programmable to a supplier database and a willingness to cooperate are the only real requirements for putting such systems in place. To provide reasonable functionality, the supplier data- base should be maintained on a workstation-class, multitasking computer; however, this need not be excessively expensive.


Investigators interested in experimenting with this system at their sites should contact

  • Elbert Branscomb
    (510/422-5681, Fax: 510/422-2282, Internet: elbert@alu.llnl.gov)

Elbert Branscomb and Tom Slezak (LLNL) and Robert Robbins (GDB) contributed to this article. Many more people, especially those at GDB, were critically involved in the project itself. Richard Lucier (now at the University of California, San Franciso) and Peter Pearson (GDB) played essential roles in responding favorably and aggressively to original suggestions that this data transfer experiment be undertaken; many GDB technical staff members worked very hard to make the project successful.


HGMIS Staff

Return to Table of Contents

The electronic form of the newsletter may be cited in the following style:
Human Genome Program, U.S. Department of Energy, Human Genome News (v3n5).

Human Genome Project 1990–2003

The Human Genome Project (HGP) was an international 13-year effort, 1990 to 2003. Primary goals were to discover the complete set of human genes and make them accessible for further biological study, and determine the complete sequence of DNA bases in the human genome. See Timeline for more HGP history.

Human Genome News

Published from 1989 until 2002, this newsletter facilitated HGP communication, helped prevent duplication of research effort, and informed persons interested in genome research.