Sponsored by the U.S. Department of Energy Human Genome Program
Human Genome News Archive Edition
Human Genome News, January 1998; 9:(1-2)
Most current tests for human exposure to environmental mutagens are only indicators of genetic damage and cannot predict adverse outcomes for individuals. In the following article, Anthony V. Carrano [Lawrence Livermore National Laboratory (LLNL)] explains that the future of genetic toxicology and mutation research lies in studying genes and individual genetic variation to reveal risk factors that make some people more susceptible to disease. The basic topic addressed by scientists who explore these issues, he notes, is the nature and consequence of genetic change or variation, with the ultimate purpose of predicting or preventing disease.
This article is excerpted from a talk by Carrano at the Human Genome Project session at the 1997 Society of Toxicology meeting in Cincinnati, Ohio. Other speakers were J. Craig Venter (The Institute for Genomic Research), Henry Wagner (Johns Hopkins University), and Richard Woychik (now at Case Western Reserve University).
The first 6 years of the Human Genome Project are behind us, and now resources can be applied to functional genomic studies, the genomics of the future. Functional genomics will be facilitated by completing the entire human genome sequence as soon as possible and, along the way, sequencing a significant portion of the mouse and other model-organism genomes. We want to determine all transcript structures and gene-expression patterns in the human genome; ultimately, we want to understand the phenotypes resulting from mutations in every one of the open reading frames. Many groups have begun working in this area even before the genome project is completed, and much work reported at recent genome meetings is pointed toward functional genomics.
As of March 1997, about 7000 genes had been identified, and around 5000 had been located and mapped to human chromosomes. Original estimates of total gene number were around 100,000, and although more recent estimates are as low as 50,000, I believe 75,000 to 80,000 genes will prove a more accurate figure. This translates to an average gene density of about 1 every 30 kb. Chromosome 19, which our laboratory has mapped and studied extensively since 1987 because it was the richest in G-C content, appears to have a higher gene density than many other chromosomes.
Positional cloning, the standard approach to disease-gene finding, requires many resources. The Human Genome Project originally was set up to create such resources. In positional cloning, researchers start with families having an inherited disease, then develop genetic (polymorphic) markers to localize disease traits on specific chromosomes with further, finer delineation to a region. After that, a set of clones--either YACs or BACs--is needed to get a contiguous DNA region that would localize or have candidate genes of interest in it. Then perhaps a higher-resolution set of clones such as cosmids would allow scientists to identify cDNA transcripts and ultimately use sequence information to identify the mutation associated with the phenotype. Many of these clones can be used now by the research community, and the cDNAs are available.
It is likely that these steps will be unnecessary in another 5 to 10 years because we will have the sequence for both normal and affected individuals. In some cases, we won't know the disease in individuals, but we can look at sequence information to predict the disease or function for the disease-associated gene. That's the goal we are shooting at—to bypass steps in mapping and sequencing.
Collections of cDNA molecules, which represent the coding (gene) sequences of the genome, offer researchers a way to bypass the millions of bases of noncoding DNA to obtain the sequences having the greatest biological significance. The I.M.A.G.E. consortium has the largest collection of publicly available cDNAs, which are used throughout the world. I.M.A.G.E. was started as a collaboration by four individuals: Greg Lennon (then at LLNL), Mihael Polymeropolus (NIH), Bento Soares (then at Columbia University), and Charles Auffray (CNRS, France).
I.M.A.G.E. now has more than 56 human cDNA tissue libraries that are continually being expanded. They are collected at LLNL, where they are also arrayed and characterized, and information on them is sent to the Genome Database (GDB). Of the libraries in the I.M.A.G.E. collection, more than 500,000 clones have been arrayed. Using clustering algorithms to ask how many unique clusters of ESTs are present in GDB, about 60,000 have been identified.
These clones are sent from distribution centers to scientists throughout the world. At the same time, they are sent to Washington University scientists who determine the sequences on the 5′ and 3′ ends and enter the information in the dbEST database at the National Center for Biotechnology Information.
Recently, the I.M.A.G.E. group has been adding mouse cDNA libraries, many of which are normalized. For the mouse, cDNA libraries have been created from staged times during development to help biologists understand when certain critical genes are expressed (I.M.A.G.E.: http://bbrp.llnl.gov/image/image.html).
Some interesting uses of these resources for studying diseases with a toxicological or genetic component can be illustrated by our group’s work a few years ago in collaboration with scientists studying the human CYP gene family. More than 60 genes code for the cytochrome P450 enzyme superfamily, which is involved in the metabolism of almost all chemicals to which we are exposed in our internal and external environments.
The CYP2A gene family codes for enzymes that are the first actors in a long pathway to detoxify and excrete xenobiotic chemicals (i.e., synthetic compounds foreign to living systems, such as drugs and insecticides). Available CYP2A family cDNAs were probed against the Livermore cosmid libraries developed as part of the National Laboratory Gene Library Project. Through a set of automated technologies, the cDNAs were built up very quickly into a contig spanning 350 kb. In that stretch are 11 genes from the CYP2A family, averaging 1 every 30 kb.
Looking specifically at the CYP2A6 genes in 182 individuals of various ethnic backgrounds, investigators found two sequence variants. Variant 1 had a single amino-acid change, and the heterozygous state resulted in a reduction of activity to about 50% to 80%. Responsible for coumarin metabolism, CYP2A6 is called coumarin hydroxylase. Coumarin, a drug used in the formulation of blood anticoagulants, also has been suggested and evaluated for the treatment of lymphoedema. Heterozygotes for this variant have a 50% to 80% reduction in activity for this particular enzyme. Homozygotes range from no activity up to 50% activity. Highly variable activity is associated with these heterozygotes; in fact, the pharmacological basis of this is quite well understood and determines drug treatment and outcomes. A second variant has several differences in its sequence. There are no known homozygotes for this variant. As far as we now know, it does not produce a functional protein.
The goal of these studies --to identify links between disorders and individual variabilities in DNA sequences ("polymorphisms") --will help researchers and clinicians identify people who may be predisposed to developing disorders such as cancer. (The converse also is probably true—some DNA variations may offer protection from these diseases.) With this knowledge, individuals can make informed decisions about aspects of their lifestyles (e.g., diet and occupation) that may help prevent or delay the onset of some diseases.
DNA Repair Genes
DNA repair genes, discovered by Richard Setlow and other DOE-funded researchers in the mid-1960s, also are important to toxicology and mutation research. LLNL has focused its research on chromosome 19 not only because of its high G-C content but because of the DNA repair enzymes encoded by genes on this chromosome. LLNL has over a 20-year history of DNA repair studies, beginning with Larry Thompson and others, that ties into the DOE goal of understanding mutations caused by exposure to ionizing radiation and other environmental pollutants associated with energy production and use.
Several genes involved in different DNA repair pathways such as base or nucleotide excision repair have been cloned, mapped, and the cDNAs isolated. In many cases, the genomic sequences have been determined at LLNL. Functions that have been ascribed to several of the DNA repair genes include helicase, endonuclease, polymerase, ligase, and others not fully understood but somehow involved in recombination and repair processes.
These known human genes have been found to be very similar to genes in yeast and even, in some cases, bacterial genes. Since we understand the DNA repair and metabolism systems much better in yeast and bacteria than in humans, we can begin with data from these model systems to understand human systems.
To take the DNA repair work a step farther, one of the DNA repair genes --ERCC2--is interesting because it has variable expression in individuals and is associated with three different diseases. The most severe of these diseases is xeroderma pigmentosum type D, which is characterized by high sensitivity to uv light, high cancer incidence, and some neurological disorders. Another of the diseases, Cockayne's syndrome, is characterized by slight photosensitivity and severe neurological defects but does not have a high cancer incidence.
The third ERCC2 disease, trichothiodystrophy, presents a defect in metabolism that causes brittleness of hair, pale skin, slight photosensitivity in about 50% of patients, and some minor neurological defects. Severity can vary. Chris Webber at Livermore and collaborators have found mutations associated with these diseases throughout the entire gene but none can be associated absolutely with the phenotype. ERCC2 is one protein in a multiprotein complex and is part of a transcription factor complex necessary for gene transcription into mRNA. Protein folding and interaction in the complex may affect the form and phenotype produced.
Does DNA variation mean anything? Can we use DNA variation, as we've seen in the previous examples, as an indicator of disease or susceptibility? LLNL scientists Harvey Mohrenweiser and Richard Shen have measured variability in the coding sequences of some DNA repair genes in people. Some variants in the DNA sequence occur at high frequency and lead to nonconservative amino-acid changes in the protein. The next step is to link this information to populations that have increased susceptibility to disease.
Migraine is another example of a disease with strong genetic and environmental interactions, although the environmental component is not well understood. About 24% of females and 12% of males are affected by migraine, and some healthcare researchers believe it is the most common reason for seeking outpatient care in the United States. A subset of migraines, called hemiplegic migraine, is often preceded by an aura. Hemiplegic migraine has also been associated with another disease called episodic ataxia. Both of these are now known to be associated with the same calcium channel genes.
In a collaboration with researchers in the Netherlands, Mohrenweiser and others at Livermore were able to link the disorder with changes in a region of chromosome 19. In this case, resources available from the genome project included a clone collection, and it was suspected that a candidate calcium channel gene in a 1- to 3-Mb region of chromosome 19 was associated with familial hemiplegic migraine. With techniques to trap the cDNAs in that large region, the team found the alpha-1 subunit of the candidate gene. The alpha-1 subunit is present in four copies; it is four subunits long, and each subunit is composed of six alpha helical turns spanning the membrane, along with a pore section that controls and forms a channel in the membrane.
Family members were identified, the gene was sequenced for the alpha-1 subunit, and mutations were found. Several mutations are associated with the alpha helical units for familial hemiplegic migraine as well as with the pore unit. Mutations associated with episodic ataxia in the same region correlate closely with the condition in these families. We know that some calcium channel genes are involved in 5-hydroxytryptamine release, and their pharmacological role in migraine would be very interesting to pursue.
More important, certain environmental factors might help trigger migraine onset. Environmental factors that influence the gene constitute an area ripe for further studies.
As mentioned, many novel human genes are being identified through comparisons with sequences from other species. In mouse-human comparisons, researchers use gene probes that will identify both human and mouse clones. For the DNA repair gene XRCC1, there are 17 exons in human and the same number of highly conserved exons in the mouse. Interestingly, some noncoding regions are also conserved. What function do these highly conserved, noncoding regions perform? Are they actually putative regulatory regions?
To understand the functions of gene sequences, researchers can use a suite of techniques including the standard biochemical approaches, structural approaches, or animal models such as knockout technologies in mice. In knockouts, a particular gene is disabled in the mouse, and any resulting changes in the animal are noted.
The work discussed here could not have been done without automation in some technologies. Robotic systems are available that can put tens of thousands of DNA spots or DNA clones on a high-density filter, and hundreds of filters can be created automatically in a day. Using these filters is the fastest way to find a gene. The filters can be mailed anywhere in the world, and receiving laboratories can probe the filters with their own DNA sequences. Coordinates identifying the probe-positive clones containing a gene of interest can be sent back to the originating laboratory, and the genes can be pulled from the collections of clones in the refrigerator.
Sequencing will change dramatically. With the advent of new technologies, especially those based on microchannel or capillary sequencing, we will increase the rate at least 10- to 30-fold over the next 3 to 4 years. With totally automated refilling and minimal human intervention, it should be possible to produce 500 bases of sequence per lane every 2 hours.
As a final example, everybody wants to do the polymerase chain reaction (PCR), and forensic toxicologists want to do PCR in the field. A portable PCR machine was developed at LLNL with Department of Defense funding as a suitcase device that can do PCR in 20 minutes for a single gene or complex set of genes. This machine, based on a new silicon chamber device developed at LLNL, is now being commercialized.
So these are some technologies that are here now or coming on the scene in the next few months. It would be wonderful to see such resources and information in the hands of many more investigators. They have not yet taken full advantage of these opportunities. The time is past due for the mutation research community to use resources and capabilities produced by genome research.
The electronic form of the newsletter may be cited in the following style:
Human Genome Program, U.S. Department of Energy, Human Genome News (v9n1).
The Human Genome Project (HGP) was an international 13-year effort, 1990 to 2003. Primary goals were to discover the complete set of human genes and make them accessible for further biological study, and determine the complete sequence of DNA bases in the human genome. See Timeline for more HGP history.
Published from 1989 until 2002, this newsletter facilitated HGP communication, helped prevent duplication of research effort, and informed persons interested in genome research.