Archive Site Provided for Historical Purposes
Sponsored by the U.S. Department of Energy Human Genome Program
"Microbial Genome Research and Its Applications" was the topic of the 35th Hanford Symposium on Health and the Environment held October 21-24, 1996, in Richland, Washington.1 According to participants, the genomes of as many as 100 microorganisms are expected to be completely or partially sequenced by the turn of the century.2 The overriding question during the meeting was, How do we deal with, interpret, and use all this new information?
The meeting served well as a status report on whole-genome microbial sequencing and on initial attempts to use the information. The complete 3.6-Mb sequence for the cyanobacterium Synechocystis sp. PCC6803 was announced by Hirokazu Kotani (Kazusa DNA Institute, Japan). 3 Progress was reported for Treponema pallidum, the causative agent of syphilis (George Weinstock, University of Texas Medical School); the hyperthermophilic Archaeon Pyrobaculum aerophilum (Jeffrey Miller, University of California at Los Angeles); Pyrococcus furiosus, a thermophilic Archaeon (Robert Weiss, University of Utah); and the hyperthermophilic eubacterium Aquifex VF5 (Ron Swanson, Recombinant BioCatalysis, Inc.).
Presentations and discussions also focused on techniques for generating clones for sequencing, different ways of identifying genes and gene functions, and databases.
Identification of Gene Function
The explosion of information from the genome projects has spawned a number of efforts, some of them coordinated, to develop readily accessible databases that are truly useful to individual scientific investigators. The session chaired by Ross Overbeek (Argonne National Laboratory) covered some of these attempts and their underlying problems. A major challenge is to annotate completed genomes at a rate comparable to the generation of raw sequence data.
Owen White [The Institute for Genomic Research (TIGR)], describing specialized in-house software to help identify frame-shift errors, pointed out that tracking problems can result in lost information when data is moved between projects or among institutions. He described how, for some metabolic pathways, genes for key enzymatic proteins cannot be found. Is this because the microbe lacks the enzyme, or is it due to the inability to identify corresponding open reading frames (ORFs)?
Overbeek pointed out other problems in relying on sequence homology for gene identification. He cited one case in which enzymes shared sequence homology but belonged to unrelated metabolic pathways. To help identify these problems, Overbeek and Niels Larsen (Michigan State University) are linking these gene lists to known metabolic pathways on a Web site called WIT,4 which replaces the former PUMA site.
Others are using different computer-assisted approaches to identify ORFS of unknown function. Monica Riley (Woods Hole Oceanographic Institute) suggested that genes encoding enzymes need to be broken into functional domains before they are assigned to databases. Enzymes can exhibit multiple functions by encoding multiple functional domains in a single protein or by complexing multiple protein subunits of different function. Domain order may be shuffled and still result in a similarly functioning enzyme that would go unrecognized if entire gene sequences were compared. Using Riley’s domain approach could suggest additional functions for enzymes having only one known function (see p. 6 for more details on Riley’s approach).
Keynote speaker J.Craig Venter (TIGR) was one of a number of participants who expressed concern about the consequences of poor control over the quality of data entering public databases. Another issue was how much time should pass between sequence determination and deposition in a public database for government-sponsored projects. Some attendees supported immediate deposit of new information to create and maintain a level playing field and to minimize the commercial advantage of firms paid by federal sponsors to acquire the sequence. Others felt that a greater risk of archiving bad data would accompany its deposition without some measure of preliminary quality assurance.
Innovative work in the laboratory clearly is needed to test hypotheses suggested by in silico methods and to search for functions associated with unidentified ORFs. A novel approach for identification was presented by George Church (Harvard Medical School), who described the development of Genomically Engineered Multiplex Selection (called GEMS) to measure simultaneously the survival effects of in-frame deletions in many genes in many environments. To determine whether specific genes are required under a given set of conditions, each gene is deleted in frame, mutants are pooled, and mutant loss is monitored during growth under various culture environments.
Some functionally equivalent enzymes have evolved independently and thus are coded for by genes with unrelated sequences. Jay Short (Recombinant BioCatalysis) described a high-throughput screening strategy in which total genomic DNA is extracted from the environment, cloned, and tested for expression of enzymatic functions of interest. The result is a pool of gene sequences (an “environmental library”) for proteins capable of carrying out a single function. The collective genomes are archived in recombinant form using cloning vectors containing the Escherichia coli F-factor origin of replication, which permits high-fidelity replication of 40- to 300-kb cloned DNA fragments. As an example of this innovative concept’s potential, Short reported the discovery of previously unknown Archaeal genes for RNA helicase, which is responsible for ATP-dependent alteration of RNA secondary structure; and glutamate semialdehyde aminotransferase, which is involved in initial steps of heme synthesis.
Others such as Gerben Zylstra (Rutgers University) are taking an alternative approach to identifying enzymes with unique sequences. Zylstra's interest is in pathways involved in degradation of aromatic compounds, but the same approach could be applied to other pathways. By selecting for lack of hybridization to a suite of probes for all representative enzymes known to be involved in a specific pathway, Zylstra has successfully identified unique enzymes from enrichments of aromatic hydrocarbon-degrading bacteria from polluted environments.
Hans Peter Klenk (TIGR) discussed a report in Science5 that the majority (62%) of predicted protein-coding genes in the Methanococcus jannaschii genome are of unknown function. The report has highlighted concerns that underlying assumptions used to assign probable function to ORFs need to be assessed carefully. Thus, approaches described above will lead to sequences of genes that share function but not necessarily sequence, thus underscoring a major point to emerge from the symposium—that care must be taken when interpreting results from the flood of sequence information being placed in the public domain.
The inverse has also been shown6 for very different sequences encoding unrelated primary protein structures that result in similar functions. Gary Olsen’s (University of Illinois) presentation on the use of sequence information (primarily rRNA gene sequences) to infer phylogeny demonstrated the need for careful consideration of sequence similarity for purposes other than predicting gene-product function.
Many scientists predict that the availability of dozens of complete microbial sequences is a major breakthrough in fundamental biology that surely will result in new ways of approaching industrial microbiology and fighting infectious diseases. Without adequate tools for managing information or plans for prioritizing ever-shrinking financial resources for science, however, much may go unrealized, and many of the benefits may be deferred longer than necessary. [F. Blaine Metting (fb_metting@ pnl.gov) and Margaret F. Romine (email@example.com), Pacific Northwest National Laboratory]
1. See http://www.pnl.gov:2080/health/attend.html for a list of symposium participants.
2. See http://www.tigr.org/tdb/mdb/mdb.html for listings of microorganisms whose sequences have been or are being determined.
4. http://www.cme.msu.edu/WIT (viewable only with frames-capable browser).
5. C.J. Bult et al., Science 273, 1058–73 (1996).
6. P.C. Babbitt et al., Science 267, 1159-61 (1995).
The electronic form of the newsletter may be cited in the following style:
Human Genome Program, U.S. Department of Energy, Human Genome News (v8n3).
The Human Genome Project (HGP) was an international 13-year effort, 1990 to 2003. Primary goals were to discover the complete set of human genes and make them accessible for further biological study, and determine the complete sequence of DNA bases in the human genome. See Timeline for more HGP history.