Archive Site Provided for Historical Purposes
Sponsored by the U.S. Department of Energy Human Genome Program
In this issue...
Available in PDF
In the News
Web, Publications, Resources
Meeting Calendars & Acronyms
A common question asked by incredulous audiences 15 years ago was, "Whose genome will you sequence?" After all, there are several billion human genomes, we were reminded, all of them different. I often answered somewhat cryptically that we would sequence everyone's and no one's. We were after a reference human genome the organizational and structural properties of the genome that are invariant across our species. With this reference sequence now in hand, we are in a position to return to the more subtle and complex problem of diversity and to approach it with a power that scarcely could have been imagined 15 years ago.
Understanding diversity was in fact a central motivation of the Human Genome Project from the start. I recall in 1985 Mark Bitensky, then Director of Biological Sciences at Los Alamos National Laboratory, arguing passionately for genomic tools to characterize the molecular basis of disease predisposition and resistance and to develop an understanding that would make possible individualized medicine what we now call pharmacogenomics.
Characterizing human polymorphisms would require rapid and accurate technologies for differential sequencing to detect that rare 1 in 1000 base substitution that on average distinguishes different genomes. The required technology was not available in 1985, but today mass spectrometry techniques can be used to perform 250,000 assays a day for single base substitutions at an error rate of less than one part in 10,000, thus providing the gold standard for single nucleotide polymorphism verification. We now are beginning to assemble the database required for the arduous task of associating complex disorders with sets of common alleles (that is, different versions of the same gene).
In a very real sense, much of biology is about change. Reference genomes, along with various machine- learning algorithms developed and adapted during the genome project, also are bringing microarray technologies to full power. In a few years, annotation of the human genome will be complete, and probes for every gene will be arrayed on a solid phase substrate of a few square centimeters. What microarray technology does better than any other current method is to monitor and characterize change, whether it is the result of normal cell growth and development, progression toward disease, or response to exogenous ligands. Diversity thus will be characterized with a precision and breadth that may well revolutionize medicine during the coming decades.
But, perhaps, an even more fundamental change has begun. The high-throughput computational and experimental methods of the post-HGP era are forcing molecular biology away from its spectacularly successful reductionist roots back toward the integrative systems physiology required to understand cell behavior. High-throughput genomic methods, which are a revolution for global characterization, are a start in that direction. The long leap from characterizing to understanding, however, will be possible only after analogous technologies are developed for proteins.
Although massively parallel technologies for protein profiling are not yet available, ideas are abundant and fledgling methods are proliferating. In the next 5 to 10 years we can expect the emergence of high-throughput technologies for profiling proteins in various states of modification. Since many patterns of information processing, feedback control, memory, and the like will be widely conserved across species, bioinformatics tools many yet to be developed will amplify what is revealed. The result will be a very rapid increase in the rate of pathway and network discovery, classification, and correlation with cell behavior. We thus will begin to build a deep understanding of the molecular correlates of change, of the differences and similarities between cells, and of the differences between identical cells under varying chemical and physical conditions. These will include some largely overlooked conditions such as stress imposed by mechanical forces and the constraints of geometry.
At the heart of biological processes, whether thermodynamically or kinetically driven, is molecular structure. The ability is well within reach to determine the structure of almost any protein domain computationally and therefore rapidly and inexpensively.
There are two reasons for such optimism. First, the number of folds is relatively small. The best estimate, based on rigorous statistical sampling theory, is about 1300 for water-soluble domains. The same calculation indicates that, with an increase of 20% a year in protein-structure determination, 95% of all such folds can be determined in the next 15 years, even if we continue to select sequences at random. That's a generous estimate on time because, by using the right type of information, we can readily choose sequences in a way that increases the odds of finding a new fold. The second reason for optimism has to do with advances in algorithms for structure determination and, more important, the development of accurate and rapidly evaluatable free-energy functions that include solvation.
A fold provides only a first approximation to geometry, but, with the fold in hand, the detailed structure can be determined computationally. The ability to obtain virtually any structure at will opens the way to cell- system design, pathway modification, and selection by directed mutations in proteins. Although humans took center stage in the genome project, every scientist recognized the universality of the methods. In particular, their potential applicability to DOE's environmental programs was of great interest from the outset.
Indeed, the new DOE plan, Genomes to Life, reminds us that microbes are ". . . the largest reservoir of . . . diversity on the planet." Specify an environmental condition, whether in a deep sea vent where temperatures hover above 100 C or in a waste site where radiation levels greatly exceed lethal human doses, and there will be found an ecological niche with microbial communities that have learned to accommodate and flourish. Among the most important challenges of post-HGP biology will be characterizing, understanding, preserving, and exploiting life's robustness.
As experimental methods become increasingly powerful, the mathematical and computational methods of systems engineering will be essential for converting data to knowledge, and yet another discipline will be drawn into the biological sciences. The dexterity with which the computer science community responded to the genomic challenge and the increasing acceptance of computation in the bench scientist's tool kit constitute one of the most extraordinary developments in the recent sociology of science. These trends no doubt will continue and become more pervasive as the abstractions underlying life processes are revealed and data generation continues to accelerate. The recently added Centers of Excellence in Biomedical Computing at the National Institutes of Health and similar programs at the National Science Foundation and DOE are excellent initial steps in preparing for the future of a science whose culture is changing almost as rapidly as the science itself.
The electronic form of the newsletter may be cited in the following style:
Human Genome Program, U.S. Department of Energy, Human Genome News (v11n3-4).
The Human Genome Project (HGP) was an international 13-year effort, 1990 to 2003. Primary goals were to discover the complete set of human genes and make them accessible for further biological study, and determine the complete sequence of DNA bases in the human genome. See Timeline for more HGP history.