Sponsored by the U.S. Department of Energy Human Genome Program
Human Genome News Archive Edition
Human Genome News, May-June 1995; 7(1)
The following article was adapted by HGN staff from two reports prepared for the Human Genome Organisation (HUGO) by freelance writer and editor Alison Stewart and published in Human Genome Digest 2(2), 1-4 (April 1995) and 2(3), 6-9 (July 1995). Substantial progress in developing a public-domain human transcript (gene) map was reported May 9-10 by researchers at Human Gene Map Workshop (HGMW) II in Cold Spring Harbor, N.Y., thesecond such workshop organized by HUGO. The Human Gene Map Initiative-an international effortto find and map expressed human genes and deposit the information in public databases-began last October in Washington, D.C., at a meeting sponsored by The Wellcome Trust; strategies were developed January 24-25 at HGMW I in London. This article summarizes highlights from the London and Cold Spring Harbor meetings. Cooperation for mutual advantage was the take-home message of the january hgmw which brought together academic and industrial scientists with representatives from pharmaceutical companies institutions funding public research. at meetingan overall picture emerged several loose associations or laboratory consortia that are sharing materialscoordinating activities to minimize overlap contributing their information databases. various reported using complementary strategies exchanging information.
The strategy of choice for building up the gene map is based on ESTs-short, identifying sequences obtained by partially sequencing cDNAs. ESTs are obtained from cDNAs represented in arrayed libraries from various tissues. If suitable primers for an EST are designed, PCR can be used to amplify the corresponding sequence from genomic DNA. The EST is thus converted to an STS that can be mapped to a genomic location using radiation hybrids (RHs) or genomic clones such as YACs and BACs.
Charles Auffray (Centre National de la Recherche Scientifique and Généthon) and Greg Lennon [Lawrence Livermore National Laboratory (LLNL)], two founding members of the Integrated Molecular Analysis of Gene Expression (IMAGE) Consortium, reported on the use of arrayed cDNA libraries for gene sequencing, mapping, and expression studies [HGN 6(6), 3 (March 1995)]. All IMAGE collaborators deposit their data into public databases; as of June 27, 154,669 of the 206,654 human clone-derived EST sequences in dbEST were from IMAGE clones (url no longer available). Records for over 147,000 IMAGE clones have been submitted to the Genome Data Base (GDB) as well. The LLNL group, which as part of the IMAGE consortium supplies cDNA clones to both the Washington University (WU)-Merck & Co. and Généthon sequencing groups, will construct master arrays (perhaps 10,000 at a time) of clones representing unique genes, regardless of the library of origin. A running update of unique genes resulting from the Merck Initiative is available from http://image.llnl.gov/.
At the May meeting, Lennon set out some remaining tasks; for example, preparing master arrays of definitive cDNAs, sequencing full-insert cDNAs to complement genomic sequencing projects, and finding the remaining human genes.
The Merck-WU Initiative, involving groups at Merck, WU, and LLNL, aims to isolate one cDNA clone for each expressed human gene. This collection, which will be distributed by nonprofit and commercial concerns at reasonable cost, will be called the Merck Index. By March 1996, according to Rick Wilson (WU) and Keith Elliston (Merck), the initiative plans to obtain ESTs from 200,000 clones from a variety of primary and normalized cDNA libraries. Since February 10, new ESTs obtained by Merck collaborators have been added to dbEST at the National Center for Biotechnology Information (NCBI) at the rate of 4000 to 6000 each week. In the first phase of the project, highly and moderately expressed genes are likely to be overrepresented. In the second phase, such techniques as oligonucleotide fingerprinting and subtractive hybridization procedures may be used to identify clones representing more rarely expressed genes. Perhaps 70 to 80% of the 50,000 to 100,000 genes will be represented in dbEST by the scheduled completion date for the Merck initiative.
David Cox (Stanford University) outlined the organization of an international RH consortium that includes laboratories at the Stanford Human Genome Center and Whitehead-MIT Genome Center in the United States; University of Cambridge (UC), Sanger Centre, and Oxford University in the United Kingdom; and Généthon in France. The consortium employs two different panels of whole-genome RHs to generate high-resolution maps integrating human cDNAs with meiotic linkage markers. The Stanford RH panel consists of 83 hybrids and results in maps of 0.5-Mb resolution, and the Genebridge panel (Généthon-Cambridge) produces lower-resolution maps but allows markers to be linked more readily. Consortium laboratories are using RH DNAs distributed by Research Genetics Inc.
STSs generated from the 3' ends of cDNAs are being used in conjunction with RH DNAs to map most human genes over the next 2 to 3 years. A database at the Sanger Centre (RHALLOC) helps to coordinate efforts by listing markers being mapped in consortium laboratories. Raw mapping data and finished maps are being deposited in a variety of public databases, including European Bioinformatics Institute (EBI) and GDB.
In early May, dbEST contained 132,674 human EST sequences and accounted for almost half the number of sequence entries in GenBank®. Of these sequences, 83,602 had been submitted by the Merck sequencing group. According to Mark Adams [The Institute for Genomic Research (TIGR)], over 100,000 additional ESTs generated by TIGR will be made available through the TIGR database.
The question of redundancy in the set of ESTs is an important one: Just how many different genes are represented? Groups at Merck, TIGR, Généthon, and NCBI are tackling this question by grouping sequences into overlapping clusters that are likely to originate from the same gene. In a preliminary version of the Merck Gene Index, 38,149 ESTs reduced to 17,743 different genes. Auffray estimated that Généthon's collection of nearly 27,000 EST sequences represents about 9000 to 10,000 different genes. Adams described development of a TIGR database of tentative human consensus sequences based on 270,000 private and publicly available ESTs.
At NCBI, a UniGene-UniEST database is being built to supply high-quality, nonredundant sequences to mapping groups. The UniGene set, compiled from all gene sequences in GenBank that have a bona fide 3' untranslated region, contains 3125 sequences. UniEST, with 13,900 sequences, represents a set of unique ESTs based on 3' sequence reads from the Merck sequencing group. Mark Boguski (NCBI) estimated that UniEST and UniGene provide sequence data for at least 15,000 different human genes, perhaps 15 to 25% of all human genes, and urged that mapping groups use sequences from both sets. Boguski and Greg Schuler (NCBI) will coordinate the supply of gene-based mapping reagents to major mapping consortia and other interested groups.
As the number of ESTs grows, assessing the accrual rate of new sequences and developing strategies for finding unrepresented expressed sequences are important. The normalized cDNA libraries prepared by M. Bento Soares (Columbia University) continue to be the richest and cleanest source of novel ESTs; for example, the infant brain was still yielding substantial numbers of new sequences after about 20,000 clones had been sampled. To track down rare transcripts that may be represented only in specialized tissues, new libraries will be prepared from more specific cell and tissue types, for example, adult retina, spinal cord, pineal gland, and multiple sclerosis lesions (Soares); and inner ear, hair roots, and glomerulus [Kousako Okubo (Osaka University)]. Some remaining problems in library construction include apparent differences in mRNA synthesis efficiency from different genes and the presence of clones without a poly A tail, for example, resulting from internal priming events-now less than 10% in Soares' libraries.
Building a complete gene map requires information integration from different mapping approaches. Expressed sequences can be mapped at different levels including chromosome assignment (50- to 250-Mb resolution) and mapping to a DNA clone or contig or an RH (less than 1-Mb resolution). This year has seen a large increase in the number of ESTs and other markers mapped to the highest-resolution levels. Of more than 2000 IMAGE transcripts assigned to chromosomes by the time of the January meeting, perhaps half have now been mapped to higher resolution by the worldwide IMAGE community of over 100 researchers.
Mihael Polymeropoulos [National Center for Human Genome Research (NCHGR)] described progress in pilot experiments to map chromosome 8 cDNAs to YACs. Tom Hudson (Whitehead- MIT) reported screening 2300 gene-derived markers, of which 1500 were ESTs, against the CEPH mega-YAC library. This is part of a larger project in which 10,000 chromosomally assigned STSs have been screened against the YAC library, giving an average marker spacing of 300 kb. Currently, 70% of the contigs are anchored by genetic markers. Pilot experiments are under way to compare and integrate the map with RH maps. In general, good concordance exists among genetic, RH, and YAC maps.
Two sets of RHs are being used to construct framework maps with a resolution of around 500 kb that can be used to locate unknown ESTs. The Stanford hybrids are generated with a higher dose of X rays and have about twice as many breaks, thus allowing mapping at higher resolution, but the Cambridge-Généthon set allows significant linkage to be obtained with a higher fraction of cDNAs. These sets were described by Cox, Karin Schmitt (UC), and Jean Weissenbach (Généthon). Pilot EST mapping studies were explained by Cox and David Bentley (Sanger Centre) for the Stanford set and by Weissenbach for the UC-Généthon set. Cox also described the development of a new set of hybrids that can be used to construct maps at around 100-kb resolution; about 30,000 markers will be needed for this panel.
Ken Buetow (Cooperative Human Linkage Center) described progress toward generating, maintaining, and distributing integrated high-resolution genetic maps highly enriched for user-friendly, PCR-based markers. ESTs are proving to be a source of useful genetic markers that can be linked explicitly to the genetic map.
As the gene map is built up, it will provide reagents and information that can be applied to genome sequencing. BACs, with an average insert size of 150 kb, could be an important mapping resource in establishing a sequence-level map. In a pilot study described by Hiroaki Shizuya [California Institute of Technology (Caltech)], 78 of 80 STSs from chromosome 22 were positive on the Caltech BAC library of 100,000 clones, while only 3 of 60 chromosome 22-specific cDNAs found no corresponding BAC clone. Research Genetics, in collaboration with Caltech, has prepared mouse and human genomic BAC libraries for distribution (see BAC Libraries Available).Construction of suitable PCR primers allowing EST conversions to genomic STSs is, for manyresearchers, the most expensive component of a mapping project. Adams described progress inTIGR's undertaking to provide 10,000 STS primer pairs for ESTs in public databases. By May, 4014primer pairs had been ordered from Applied Biosystems. Of 2700 received, about 900 were checkedand over 700 distributed to members of the RH mapping consortium. Primers developed by TIGRwill be sold via American Type Culture Collection (ATCC) on condition that information resultingfrom their use is placed in public databases as quickly as possible. Mapping groups at the May meeting expected to map around 37,500 ESTs within the next year; thisprobably represents a realistic total of 10,000 to 20,000, a huge increase over the nearly 1000 mappedat the end of 1994. Direct cost estimates for mapping an EST ranged from around $170 to $240,taking into account some 30% that fail; costs could fall if this percentage can be lowered. At the January meeting, Robert Waterston (WU) outlined his scenario for completing asequence-level map of the entire human genome by 2001 with 99% coverage and 99.99% accuracyin coding regions. Waterston estimated that this ambitious project could be completed by threesequencing centers at a cost around $0.10 per base. In the view of Waterston and John Sulston(Sanger Centre), mapping and sequencing are part of a continuum; transcript mapping should be seenas an intermediate stage in progress toward the ultimate sequence-level map, which will reveal theremaining genes. Then the real work of functional characterization will begin. in this contextsupporting complementary on other mammals such as the mouse is seen vital helping to identify true transcriptsaid mapping by taking advantage syntenic relationships between genomesand provide a system for studies.
To be usable by the scientific community, all information generated by public-domain gene-mapping projects must be accessible. Boguski, Graham Cameron (EBI), Keith Elliston (Merck), and Chris Fields [National Center for Genome Resources (NCGR)] outlined some aspects of the informatics network being set up to meet these requirements. All public-domain ESTs are deposited in dbEST, which, as the result of an international data agreement, is mirrored at EBI. Boguski (NCBI) described the development of Chromoscope, which allows sequence retrieval from dbEST by map location. Chromoscope acts as a "front end" to ENTREZ, which interfaces with other data sources on protein sequences and structures, for example. dbEST links to the Genome Sequence Data Base (GSDB) at NCGR. Release 2.2 of GSDB, due in August, will feature an alignment representation of sequence information. Its coordinate system will enable representation of multiple overlapping or allele sequences such as those for "disease" genes. Ken Fasman (Johns Hopkins University) described work under way at GDB, the major public repository for human map data. GDB release 6.0, due in early fall, will contain linkage, physical, and RH map data and information on reagents such as clones and ESTs. An enhanced WWW interface for GDB 6.0 will also be ready at the same time.
Many major laboratories involved in the Human Gene Map Initiative have developed their own WWW servers. Although these are probably the best way for inquirers to access the most up-to-date data, attendees argued forcefully that such sites are not an alternative to public databases. All laboratories have a responsibility to provide their data to at least one public database. Database curators must ensure that the information is distributed among different database sites and that the "average user" can gain access to information. Considerable confusion could be avoided if every reagent, such as a cDNA clone, carried a unique identifier (e.g., an IMAGE identification number and associated GDB accession number) used always by anyone submitting information to a database.
Identification systems by various consortia should ensure that all materials, such as clones and arrayed libraries, are traceable to their source. As part of the Merck initiative, the Computational Biology and Informatics Laboratory (University of Pennsylvania) is developing software for tracking clones, libraries, and all information accumulated about them. Many clones and libraries, such as those in Lehrach's reference library system, are available directly from laboratories where they are maintained [HGN 6(3), 7 (September 1994)]. Others, such as the Caltech BAC library, clones generated by the IMAGE consortium, and Cambridge-Généthon and Stanford RH panels, are being made available to individual users at reasonable prices via such sources as Research Genetics and ATCC.
New types of "downstream" public databases are also being developed for information such as gene expression patterns. Although the idea of a single, integrated database sounds appealing, this is probably not feasible. Attendees argued that a set of databases supporting different types of query represents a richer overall resource, as long as connections are transparent to the user.
At the January meeting, Francis Collins (NCHGR), David Owen (U.K. Medical Research Council), David Smith (DOE), and Michael Morgan (Wellcome Trust) pledged continuing support for projects integral to all gene-mapping consortia. As outlined by Manuel Hallen [European Commission (EC)], a welcome injection of new money is coming from the EC's new Biomedicine and Health Research Programme (Biomed 2). In 1995 and 1996, it will allocate 40 million ECUs to transnational genome research projects in the European region.
At the close of the London meeting, Peter Goodfellow (UC) appealed for increased cooperation between public and private sectors. He pointed out that academic institutions provide "seed corn" for private industry in the form of both people and ideas; companies willing to invest in the public sector can expect a rich return. Merck's continuing involvement is seen as essential to the rapid assembly of a public-domain gene map. Other companies represented at the meeting also signaled an interest in playing a role.
Two more meetings for fall 1995 and spring 1996 will be organized by HUGO to report progress in the Human Gene Map Initiative and usher in the era of the sequence-level map.
The electronic form of the newsletter may be cited in the following style:
Human Genome Program, U.S. Department of Energy, Human Genome News (v7n1).
The Human Genome Project (HGP) was an international 13-year effort, 1990 to 2003. Primary goals were to discover the complete set of human genes and make them accessible for further biological study, and determine the complete sequence of DNA bases in the human genome. See Timeline for more HGP history.
Published from 1989 until 2002, this newsletter facilitated HGP communication, helped prevent duplication of research effort, and informed persons interested in genome research.