Archive Site Provided for Historical Purposes
Sponsored by the U.S. Department of Energy Human Genome Program
Sequence-database challenges and features of the new GSDB schema were among topics addressed by Michael Cinkosky (NCGR), who observed that the GSDB staff's responsibility is to help the community keep the database complete, accurate, and up to date. (See New WWW Tools.) The new system will support direct client-server inserts and updates; entry versioning, which retains all versions of public entries; and third-party annotation, in which the core entry belongs to the original author. Entries are recast so that almost all data can point to links in other databases. Efforts are directed toward achieving "anonymous interoperability," with GSDB as one component in a biology-wide database federation.
GSDB staff is designing a sequence editor as an interactive tool for sequencing laboratories to view and edit large regions having complex annotation; design goals include online intuitive, graphical editing. The editor will be freely distributable and integratable. Anyone interested in participating in the design process can obtain the prototypes that serve as a basis for discussion (http://www.ncgr.org).
At the National Center for Genome Resources open house, Gifford Keen shows Margaret Jefferson (California State University, Las Angeles) the mainframe computer that manipulates all data queries coming into Genome Sequence Data Base.
Chris Fields (NCGR) outlined the emerging informatics challenges guiding NCGR's strategic planning and offered some ideas on a productive new direction for genome informatics. A key technical challenge is the diversity of data applications, which will require the connection of genome data with information generated from various disciplines and maintained in different databases. Fields observed that the community will be forced to reduce data-maintenance costs by moving from centralized data banks and databases to interoperable data resources joined by the Internet.
An even greater challenge, Fields continued, will be the need for precise description of the same biological data at successive levels of complexity. He said computer science has developed conceptual tools, including the Virtual Machine, for describing the precise context in which data occurs. The bioinformatics community should consider using these tools to interrelate such data.
A utomating sequence-data annotation is becoming increasingly important as volume overwhelms manual annotation. David States (WU) is developing improved methods for analyzing nucleic acid sequences based on sequence similarity and very large scale classification techniques. The WU group has developed an improved user interface that uses WWW, "perl," and the html protocol to link classification data with other databases. Goals include identifying unannotated reading frames in nucleic acid databases.
Fragment assembly of shotgun-sequencing data involves deciding which overlaps or melds to use in reconstructing the original strand. Eugene Myers (University of Arizona) described an algorithm that greatly simplifies the problem by first identifying all overlaps and melds that must occur in an optimal solution. For highly repetitive target DNA sequences, he further proposed use of a maximum-likelihood estimator based on the two-sided Kolmogorov-Smirnov statistic.
Edward Uberbacher (ORNL) described the latest releases of GRAIL (version 1.2) and genQuest (version 1.1). These analytical systems take sequences through a series of recognizers that pick out such features as exons, promotors, and genes; search seven databases for similarities; and incorporate the information in an annotated report. Exon-recognition rate is 94% for sequences that are correct in the database. About 20 million bases/month are processed from 4700 transactions, including e-mail GRAIL and genQuest and XGRAIL and XgenQuest. In collaboration with Fasman and NCGR, Uberbacher's group is working on faster sequence searches and parallel queries of some relational database systems.
Uberbacher presented the new GRAIL 1A, which is designed to process large files of cDNAs, ESTs, or genomic fragments to predict coding regions, search databases, and produce a summary report. GRAIL-ET, a new technology that will be useful for analyzing very low pass sequence, detects errors in coding sequences using a coding-recognition and dynamic-programming method. With a 1% indel error rate, the system found 94% of the exons (89% of the gene message after the model was made). GRAIL and genQuest can be accessed by e-mail server (GRAIL@ornl.gov and Q@ornl.gov, respectively), graphical client tools obtained by ftp at arthur.epm.ornl.gov or via Mosaic (http://compbio.ornl.gov/Grail-1.3/). GRAIL has been licensed to ApoCom, Inc., for use by researchers in proprietary pharmaceutical and biotechnology companies who cannot use Internet because of data-security concerns. Uberbacher's group is also supporting mouse-human mapping research at ORNL and has constructed ACEDB implementation containing mapping and phenotype data.
Participants look forward to the next DOE Contractor-Grantee Workshop, scheduled for January 28-February 1, 1996.
Denise Casey, HGMIS
The electronic form of the newsletter may be cited in the following style:
Human Genome Program, U.S. Department of Energy, Human Genome News (v6n5).
The Human Genome Project (HGP) was an international 13-year effort, 1990 to 2003. Primary goals were to discover the complete set of human genes and make them accessible for further biological study, and determine the complete sequence of DNA bases in the human genome. See Timeline for more HGP history.