Sponsored by the U.S. Department of Energy Human Genome Program
Human Genome News Archive Edition
Human Genome News, January 1992; 3(5)
The success of the Human Genome Project requires efficient, automated transfer of physical mapping data from laboratories to large public databases such as Genome Data Base (GDB) at Johns Hopkins University. An effort to develop such a transfer system is described below.
GDB and the physical mapping project at Lawrence Livermore National Laboratory (LLNL) have been cooperating to develop a prototype approach to automated data transfer. Physical mapping data from LLNL is now uploaded to GDB automatically and becomes part of the public release from GDB. The system will evolve through continued evaluation.
Certain generic constraints were recognized in planning the approach. Concern for the security and integrity of both the sending and receiving databases required that (1) the receiving database not be allowed unlimited access to all data in the sending database and (2) the sending database could not inject data into the receiving database without the knowledge, consent, and active participation of the recipient system's staff.
Other considerations involved minimizing adverse effects on both the sending and receiving databases: neither should have to reconfigure its internal data representations to meet the needs of the other. Efficient transfer dictated that the sending and receiving databases not be required to prepare or parse each data shipment, either manually or with ad hoc programs.
Both GDB and LLNL store data using commercial client-server-type relational database management systems (RDBMSs) running on workstation-class computers with Internet access. Although both data sets are maintained using a system from Sybase, this is a convenience but not a necessity. Database design at the two sites did not allow a direct data transfer, but addressing the generic aspects of data transfer made possible an automated system that met system constraints.
Efforts to automate the transfer of data involved the following steps:
The main characteristics of this interface design are the following:
The generic nature of GDBI may allow the interface tables used at LLNL to be used by other supplier laboratories to hold data for transfer to GDB. The use of equivalent interface databases would spare other laboratories from devising systems de novo and would ensure that biologically similar data are entered into GDB in a logically similar manner. The figure shows the arrangement of the system's main components.
GDBI consists of some 11,750 different data items arranged in 15 tables (presently aggregating to 3 megabytes). These data comprise some 26 contigs, 320 cosmids, and a total of 815 attribute facts about these cosmids.
The GDBI database will contain information on all contigs for which some useful biological fact has been ascertained. Biological facts about the contigs include finding positive hybridization to member cosmids of a unique sequence element (e.g., cDNA probes), in situ hybridization mapping results, and repetitive element content. Contigs are described according to their cosmid membership and the order of a minimal covering subset of elements.
Whereas this approach nominally requires that the supplier laboratory use a relational client-server DBMS and provide direct Internet access, the server need not use the same vendor as GDB. For example, the Sybase "Open-Server" software interface allows remote RDBMS servers supplied by other vendors to be accessed directly through standard relational database query language (SQL). With significantly more effort, the Open-Server package can be used to access supplier DBMSs that are neither client-server nor relational in design, as long as they afford Internet connection and an appropriate application program interface.
At tolerable cost, the Internet connection requirement can be relaxed and the telephone lines made to appear to the computers at both ends as reasonably high speed "Internet" links. A suitable interface that is programmable to a supplier database and a willingness to cooperate are the only real requirements for putting such systems in place. To provide reasonable functionality, the supplier data- base should be maintained on a workstation-class, multitasking computer; however, this need not be excessively expensive.
Investigators interested in experimenting with this system at their sites should contact
Elbert Branscomb and Tom Slezak (LLNL) and Robert Robbins (GDB) contributed to this article. Many more people, especially those at GDB, were critically involved in the project itself. Richard Lucier (now at the University of California, San Franciso) and Peter Pearson (GDB) played essential roles in responding favorably and aggressively to original suggestions that this data transfer experiment be undertaken; many GDB technical staff members worked very hard to make the project successful.
The electronic form of the newsletter may be cited in the following style:
Human Genome Program, U.S. Department of Energy, Human Genome News (v3n5).
The Human Genome Project (HGP) was an international 13-year effort, 1990 to 2003. Primary goals were to discover the complete set of human genes and make them accessible for further biological study, and determine the complete sequence of DNA bases in the human genome. See Timeline for more HGP history.
Published from 1989 until 2002, this newsletter facilitated HGP communication, helped prevent duplication of research effort, and informed persons interested in genome research.