Sponsored by the U.S. Department of Energy Human Genome Program
Human Genome News Archive Edition
Human Genome News, April-June 1996; 7(6)
Santa Fe 96
Redesigning GDB and GSDB
The explosive growth of information and the challenges of acquiring, representing, and providing access to data pose new and monumental tasks for the large public databases. Ken Fasman [Genome Database (GDB)] and Gifford Keen [Genome Sequence Data Base (GSDB)] discussed the restructuring of GDB and GSDB to handle the flood of data and make it useful for downstream biology.
Observing that one can't scroll or BLAST through 3 billion base pairs in a meaningful way, Fasman defined GDB's future role as the coordination site for the complete electronic description of the human genome. The map, he asserted, provides an ideal framework for jumping into the sequence (http://www.gdb.org/).
Fasman described the extensive changes made to GDB over the last 2 years that have culminated in the enhanced representation of genomic maps and gene information in GDB V6.0, which was released early this year [HGN 7(3-4), 13-14 and 7(5), 15].
Redesign of the database schema and front-end interfaces now provide true graphical genetic and physical map representation; direct community editing and curation, including third-party annotation; and an improved model for gene information that includes links to databases describing function, structure, products, expression, and associated phenotypes. A user can create a link from any GDB object to any other entity on the Internet. GDB plans to become the focal point for accessing information about the human genome.
Under the Hood
New technologies used in developing V6.0 include an object-oriented data model, object broker, data-driven WWW interface, and graphical interfaces for the most popular computer platforms. The new GDB architecture depends heavily on OPM developed by Victor Markowitz and colleagues at LBNL (see "GDB-LBNL"). GDB 6.0 data representation is captured in a schema file that drives all other pieces of software. This new architecture will enable GDB to adapt more quickly to changes in biological knowledge and representation of maps, genes, and other structures.
At the heart of the system is a Sybase database server that communicates in SQL, the relational query language. Everything from that point forward deals in complex objects, rather than in the rows and tables of a relational database.
Future enhancements will include improved map editing, an integrated editing environment, improved polymorphism and mutation representation, and integration with the specialized GSDB Sequence Annotator and Mouse Genome Database interfaces. To tie GDB to the evolving sequence databases, an interface is being developed to represent gene structure maps (maps of introns, exons, and regulatory regions associated with genes).
Keen identified data acquisition, representation, and access as major issues for sequence databases.
Capturing and Annotating Data
Data acquisition is a two-part challenge, he said. Vast quantities of sequence data will be captured with custom software for bulk-submission processes; future plans include direct database-to-database communication for direct downloading of data from laboratories into GSDB. The more difficult task in data acquisition, he noted, is capturing the follow-on sequence annotation, which is usually published in print journals and subsequently "lost." This data will be crucial for studying gene expression, variation, and function. GSDB Annotator, a graphical browser and editor, is being developed to facilitate community annotation of the database. Researchers are also working to provide access to such common analysis algorithms as BLAST and GRAIL.
Data Representation: Building Whole Chromosomes
In addition to captured sequences and annotations, information needs to be generated about relationships between sequences. The data must be maintained in a form capable of supporting complex, ad hoc queries. GSDB is working toward a model within the near future of 24 sequences for humans, one for each chromosome. As data comes in, it will be aligned to the representative sequence, which initially will have many gaps. Keen drew an analogy of GSDB as a community laboratory information-management system supporting what is essentially a multiyear, multilaboratory, multiorganism shotgun-assembly process. Feature accession numbers will enable separation of annotation from sequences.
Although GSDB has the tools and the structure (normalized and atomized data) to answer such robust queries as annotation relationships, problems with data quality and consistency do not allow this to be done well. GSDB is now mounting a major effort to develop software for rationalizing the data stream as it enters the database.
GSDB has also developed an object-oriented access library that sits on top of the database. Almost all GSDB applications and the software that imports data from other databases work through this object layer. GSDB will make the object libraries and an application programming interface available to the public. Programmatic access will be through assigned accounts, and the database can be accessed either through the object libraries or directly on the table, row, and column level.
The new GSDB schema is complete and should be operational later this year. After fairly extensive alpha and beta testing, GSDB Annotator should be released at the same time on Mac and Sun, with Windows to follow. Software will be available via ftp from NCGR's Web site (http://www.ncgr.org/).
2013 post-production note: GDB (Wikipedia) is no longer operational. See http://www.ncbi.nlm.nih.gov/projects/genome/guide/human/index.shtml
The electronic form of the newsletter may be cited in the following style:
Human Genome Program, U.S. Department of Energy, Human Genome News (v7n6).
The Human Genome Project (HGP) was an international 13-year effort, 1990 to 2003. Primary goals were to discover the complete set of human genes and make them accessible for further biological study, and determine the complete sequence of DNA bases in the human genome. See Timeline for more HGP history.
Published from 1989 until 2002, this newsletter facilitated HGP communication, helped prevent duplication of research effort, and informed persons interested in genome research.