Sponsored by the U.S. Department of Energy Human Genome Program
Human Genome News Archive Edition
Human Genome News, September 1993; 5(3)
DOE-Supported Meeting Stresses Software, Databases for Map and Sequence Data
The DOE Office of Health and Environmental Research (OHER) convened a meeting of informatics experts in Baltimore on April 26-27 to assess current OHER bioinformatics efforts and obtain advice on essential planning and coordination of future activities. The meeting was a continuation of a long-term planning strategy initiated in 1992 with an OHER review of the DOE portfolio of independent genomic informatics projects and core activities at genome centers [HGN 5(1), 3-4 (May 1993)]. OHER supports research in genome informatics, structural biology, and other programs requiring integrated applications of biological data. Meeting topics included the biological research community's informatics needs, especially for databases, and actions to ensure that these needs will be met. Although this meeting grew out of an ongoing review of DOE-supported activities, the report of the meeting is being circulated to the general community. The National Center for Human Genome Research (NCHGR) and OHER expect to use the report and the community comments to develop plans for maintaining their coordinated support of research and development of genome informatics tools and systems.
A general discussion focused on the role of community databases in facilitating OHER-supported research. More than mere archives, genomic databases provide analysis tools for project bench work. Reviewers concluded that existing community databases fall short of meeting community needs. The problems stem both from rapidly changing requirements and from conceptual and technical idiosyncracies in the design of current systems.
Investigators require access to databases containing protein sequences, crystallographic structures, nucleotide sequences, and mapping data (genes, maps, and probes). Databases managing this information include
Wide-ranging and vigorous discussion was held on requirements for DNA sequence and mapping databases. Participants felt strongly that interoperability is critical for conducting bioinformatics research and ensuring that information in biological databases is optimally useful to other research areas.
Coordination among genome databases and other informatics systems must be the highest priority. Software and database projects must interact with and consider other efforts, or the cost of systems development will be too high. In addition, projects that do not successfully interact with other databases will lead to inefficiencies associated with unlinked data. Users must be able to retrieve related data from multiple databases such as GDB, PIR, Medline, PDB, and GenBank without having to query each database separately and integrate the results themselves.
In addition to database interoperability within a specific research domain (e.g., the Human Genome Project), many workers also need integrated access to a variety of biological data areas. For example, studies of gene products and their functions may be outside genome project scope, but the value of genome data would be maximized by linking project results to databases dedicated to understanding gene products and their functions.
Community databases should be designed generically for interoperability, requiring both semantic and technical consistency among projects. For minimum semantic linkage, biological objects in all interoperating databases must be identified and cross-referenced via accession numbers.
To achieve minimum technical linkage, participating systems must at least present similar application programming interfaces through the Internet. Currently, this is achieved most cost-effectively when all interoperating databases are implemented as relational databases that support Internet standard query language (SQL) queries. Other architectures may eventually be optimal, but for now all databases need to support current community standards for data access (e.g., remote data access or SQL). Also, participating databases should be (1) self-documenting (offering an online data dictionary and other documentation), (2) stable (undergoing schema change rarely and only after ample warning), and (3) consistent (based on federation-wide semantics). However, goals of stability and consistency conflict with the need for maximum responsiveness to changing community needs.
Local incentives now often work against interoperability. Usually, community genome-relevant databases are not funded by the same agencies, advised by the same advisors, or otherwise coordinated. Community database funding is always limited, and interoperability is not often top priority; federal agencies and others will need to provide incentives to advance these priorities.
A truly federated information infrastructure cannot meet community needs without a minimum level of semantic consistency among data submitters and databases. In addition, an infrastructure will be needed to permit a client to make automatic queries to multiple databases at different locations. These databases should appear to an end user as one integrated database.
Database User Requirements
Users and producers of genome information generally fall into several institutional categories with different capabilities and needs. They include (1) genome centers; (2) independent laboratories at major research institutions; (3) individual investigators and small laboratories; and (4) other users, such as those in small clinical settings, homes, and high schools.
In discussing user requirements, the group agreed that Internet access will become an essential highway for routine, up-to-date data submission, retrieval, and analysis. Investigators and administrators should communicate with those in other scientific disciplines to increase awareness of Internet's importance in data communication and distribution. Genome centers and research institutions have different local requirements, so the need for local software tools and database designs will also vary; therefore, robust support will be required for onsite acquisition and handling and automated submission of mapping and sequencing data among sites.
Local user-support systems and academic or commercial software should be developed for interaction with community databases, and dispersed user communities will need easy-to-use data-access tools. Meeting the needs of various user classes will require widely accepted application programming interfaces to databases and associated software.
Users will need databases and software analysis tools capable of reviewing millions of base pairs of new sequence each month. Community software tools should at minimum allow machine-readable input and output. Analysis tools with differing data formats create unnecessary difficulties for automated analysis.
Access to individual databases will require analysis algorithms, interfaces, and other tools incorporated into easy-to-use suites of software and servers that can run across the network. Users will also need integrated views of data from multiple databases and interfaces as well as database-linking software that is scaleable to the large amount of biological data to be incorporated.
Data Submission and Curation
Even if great success attends bench research in the Human Genome Project, the project will not be successful unless all researchers can retrieve answers to their genome-related questions. Systems for improved data submission and curation will enable investigators to submit data to robust community databases promptly and easily and receive useful and trustworthy data.
Genome researchers, funding agencies, and the entire biological community should ensure that an infrastructure for capturing data is in place. Discussants expressed skepticism that data capture by journal scanning will be adequate for total data submission. Many journals require database submission of sequence entries and protein structural coordinates before publication, but subsequent information about the sequence or protein is not always added. Discussants suggested that journals may also need to require submission of other types of information and to reference all relevant accession numbers if the article adds annotation information.
Database curation, and probably community-based curation, will become increasingly important. A possibility raised was a new professional job category, similar to museum curators, to the maintain databases. Routine capture of data from nongenome laboratories will also be needed, and users from these laboratories should be able to add routinely to the information.
Coordination of Informatics Efforts
Attendees noted that previous Human Genome Project 5-year goals focused more on the community than on the core-support local-user domain. To achieve mapping and sequencing goals, core informatics support will be needed at genome centers and for center-to-center dispersed collaborations and robust connections.
Participants stated that more information should flow between users and programmers at different centers. Computer demonstrations at the 1993 DOE genome contractor-grantee workshop in Santa Fe, New Mexico, were praised as a step in the right direction because they allowed hands-on informatics interaction between biologists and computer experts. Other suggestions included a sequence-analysis software fair in addition to the data fair at the Hilton Head genome sequencing meeting; joint NIH-DOE genome project-wide meetings similar to the DOE Santa Fe meetings; NIH-DOE informatics workshops with experts from genome centers and major databases; and more visiting and interchange among centers, perhaps through short-term exchange of personnel.
The need for coordinating the DOE genome program with NCHGR efforts was emphasized, as was cooperation with other NIH institutes, National Science Foundation, U.S. Department of Agriculture, and foreign efforts. Attendees felt that the rate of new-software implementation should be increased and that strategies should be considered for funding more software-development research, maintaining resource databases, supporting servers at various sites, and integrating diverse software into common servers or sets of tools.
Technology Transfer, Software Sharing
Attendees noted varying degrees of transfer and sharing, from exchanging experiences in developing software and databases to sharing source codes and schemata.
Genome center informatics core support activities such as databases and software tools cannot be transferred easily among centers because of deeply embedded differences in experimental methods. Nonetheless, off-site and some on-site projects are explicitly funded to provide research results and resources to the wider community. Participants felt that informatics tools should be made more widely available through software libraries and resource listings. They also noted the need to implement community-based research into easily accessible tools.
The entire genome community and individual centers will at times need to compare results and approaches through the exchange of data and analyses performed with common standard analysis tools. A mechanism was proposed by which one or more servers could house a suite of analysis tools incorporating research results and software design of several projects.
Training programs, particularly institutional training grants that permit sites to develop courses and support students, are necessary to produce multidisciplinary people who support informatics. Attendees felt that individual fellowships should be maintained or increased. The possibility of short-term travel fellowships to allow for greater exchange of ideas and results was discussed.
[Jay Snoddy and Robert Robbins, DOE OHER, Anne Adamson, HGMIS]
OHER is releasing a report of this meeting as a part of its ongoing planning process. With input from the genome community, the agency hopes to improve a expand the report into a white paper. The report is available from HGMIS, Oak Ridge National Laboratory, P. O. Box 2008, Oak Ridge, TN 37831-6050, telephone 865-574-0597 and through the Johns Hopkins University Gopher at gopher.gdb.org under the Mathematics and Biology heading.)
The electronic form of the newsletter may be cited in the following style:
Human Genome Program, U.S. Department of Energy, Human Genome News (v5n3).
The Human Genome Project (HGP) was an international 13-year effort, 1990 to 2003. Primary goals were to discover the complete set of human genes and make them accessible for further biological study, and determine the complete sequence of DNA bases in the human genome. See Timeline for more HGP history.
Published from 1989 until 2002, this newsletter facilitated HGP communication, helped prevent duplication of research effort, and informed persons interested in genome research.