A New Cooperative Strategy for Sequencing the Human and Other Genomes
J. Craig Venter , Hamilton O. Smith, and Leroy Hood
(Submitted to Nature)
Institute for Genomic Research, Johns Hopkins Medical School, Department of Molecular Biology and Genetics, University of Washington, Department of Molecular Biotechnology
One of the principal goals of the international Human Genome Project is to sequence, in a cooperative venture, the entire estimated 3 billion base pairs (bp) of DNA contained in the 24 different human chromosomes. The order of the nucleotides across each chromosome (the sequence maps) will permit the identification of the 100,000 or so human genes and provide the framework for studying how certain DNA variations among humans predispose to various diseases. This project, initiated in 1990 by the United States government (the National Institutes of Health and the Department of Energy), has been joined by the United Kingdom, France, Germany, and Japan. The cost was estimated to be $3 billion over 15 years. The first five year period focused on genetic and physical mapping (1). The genetic map contains polymorphic DNA markers scattered evenly across the genome; the physical map is generated from overlapping DNA fragments covering the 24 human chromosomes. We are now moving into the more complex sequencing phase (2, 3). We propose here a new approach to sequencing the human genome that would greatly simplify the procedure and facilitate international cooperation between large genome centers and small groups. Moreover, it will greatly facilitate the sequencing of biologically interesting chromosomal regions like gene families, such as those encoding neural and olfactory receptors, as well as smaller genomes from simpler organisms.
The most common approach to sequencing the human genome involves a three-stage divide and conquer strategy (Figure 1) employing the construction of three different clone libraries from human chromosomal DNA randomly cut, fractionated into differing size classes and then inserted into distinct vectors capable of propagating the DNA fragments in appropriate hosts (e.g. bacteria or yeast) (Table 1). (A clone is a vector with one inserted fragment of human DNA. A clone library is the entire collection of the fragments of human DNA that are integrated into a particular type of vector in one experiment.) (i) Low resolution physical maps of each chromosome are prepared by identifying shared landmarks (e.g. unique PCR [sequence tagged sites or STSs] or restriction enzyme sites) on overlapping yeast artificial chromosome (YAC) clones. (ii) High resolution or sequence-ready maps are then prepared by randomly cutting and subcloning YAC inserts into cosmid vectors. A map is constructed by identifying their landmark overlaps. (iii) A minimally overlapping path of cosmid clones is chosen and the DNA from each clone randomly fragmented into small pieces and subcloned into the M13 phage or plasmid vectors. For each cosmid, about 800 M13 clones are sequenced (average 400 base pairs) and assembled computationally into the sequence of the 40 kb cosmid insert. This random, or shotgun, approach ensures a high degree of accuracy because every nucleotide is, on average, sequenced 8 times (480 bases/clone x 800 clones = 320,000 bases of sequence). Most genome-wide or chromosome-specific physical maps generated to date are of low resolution and based on YACs (6).
This approach to genome-wide sequencing has several challenges and limitations. (i) The initial efforts for high resolution mapping of human chromosomes 16, 19, and 22 have been very expensive and they are still not finished; that is, the problem of obtaining complete maps without gaps is significant. Completing sequence-ready maps for these and the remaining human chromosomes still remains a daunting and expensive task. (ii) Approximately 50% of YAC clones exhibit rearrangements, deletions, and chimerisms (two or more DNA fragments inserted into one clone), thus rendering them often unsuitable as mapping and sequencing reagents. The effort necessary to identify the defective YACs is significant. To a smaller degree, cosmid inserts also delete, rearrange, and form chimeras--aberrations that are also often difficult to detect. (iii) The human genome contains tandem (adjacent) arrays of very similar homology units (e.g. five tandem 21 kb arrays) or tandemly arrayed genome-wide repeats like the 7 kb LINEs that pose problems for high resolution mapping when the clone insert size is less than the tandem array length because the landmarks are very similar (e.g. 40 kb cosmid insert against 105 kb of DNA array). (iv) The conventional sequencing procedure is very complex and difficult to fully automate for the high throughput sequencing we hope to achieve in the future. (v) This approach makes cooperative collaborations among large and small groups difficult because a significant infrastructure is required for the high resolution physical mapping.
These problems, and two important scientific advances, have led us to propose a new strategy for cooperative sequencing of the human genome. The first advance rests in the ability to sequence and assemble megabase size prokaryote genomes with high accuracy and fidelity (7). The second is the development of bacterial artificial chromosome (BAC) libraries with human insert sizes up to 350 kb. BACs appear to faithfully represent human DNA far better than their YAC or cosmid counterparts (8). For example, the 1 Mb human a/d T cell receptor locus was mapped using only 17 BACs in contrast to the 75 or so cosmid clones that would have been required to achieve the same coverage. Detailed landmark analyses demonstrated that only one of 17 BACs had a defect (a small 6 kb deletion) (C. Boysen, personal communication). Moreover, BACs appear to be an excellent substrate for shotgun sequence analysis (e.g. 5/5 BACs, ranging in size from 89-210 kb, were successfully sequenced with this approach) (C. Boysen, personal communication) and other laboratories have also been successful in sequencing BACs. Accordingly, BAC clones appear to be excellent sequencing substrates that can be used to produce an accurate contiguous sequence.
Our new approach to genomic sequencing eliminates the need for any a priori physical mapping and uses BAC clones as the basic sequencing reagent (Figure 1). (i) A human BAC library with an average insert size of 150 kb and, on average, a 15-fold coverage of the human genome contain 300,000 clones. These will be arrayed into microtiter wells. (ii) Both ends (starting at the vector-insert points) of each BAC clone will be sequenced to generate 500 bases. The 600,000 BAC-end sequences will be scattered, on average, every 5 kb across the genome and will constitute 10% of the genome sequence. We will denote these end sequences as "sequence tagged connectors," or STCs, because they allow any one BAC clone to be connected on average to 30 others (e.g. a 150 kb insert divided by 5 kb will be represented in 30 BACs). The STCs would immediately be made available on the world wide web. (iii) Each BAC clone will be fingerprinted with one restriction enzyme to provide the insert size and detect artifactual clones by comparisons of the fingerprints with those of overlapping clones. (iv) A seed BAC of interest will be sequenced by any method and checked against the data base of STCs to identify the ~30 overlapping BACs. The two BACs exhibiting internal consistency among the fingerprints and minimal overlap at either end will be sequenced. The entire human genome could be so sequenced with slightly more than 20,000 BAC clones (Table 1).
This approach has several unique advantages. (i) The cost and effort to obtain complete low and high resolution maps is virtually eliminated; thus, the front end automation is greatly simplified (e.g. clone arraying, DNA purification, fingerprinting, and sequence reactions). (ii) The BAC clones can be made readily available to sequencing groups throughout the world through resource centers and/or commercial distributers. Large centers could sequence multiple BAC clones forming major contigs while small groups could contribute one or a few BAC sequences. (iii) As improved techniques for generating BAC or other yet to be developed libraries appear, reasonable numbers of these new clones could easily be added to the clone collection. (iv) It appears likely this approach will obviate the significant problem of closure for high resolution physical mapping. (v) The existing chromosomal landmarks, STS, or PCR-specific sites, and EST, or partial cDNA sequences, can be easily placed on the BAC clones, adding additional markers for BAC clones and taking significantly advantage of any associated biological information. (vi) The 10% of the genome obtained in the STCs can be searched against the sequence data base to identify many interesting landmarks (e.g. genes, STSs, EST, etc.) that could locate the BAC clone on the preexisting chromosomal maps. (vii) Chromosomal regions of key biological interest can be sequenced first. (viii) The human genome can be sequenced earlier and for less cost (e.g. the savings on high resolution physical mapping). (ix) The STC approach will provide useful clones for biological studies even at the very early STC sequencing stages when only 3- to 4-fold coverage is achieved. (x) This would be an extremely efficient strategy for sequencing compact genomes (e.g. prokaryotes and single celled eukaryotes), as well as the model organisms the genome project has committed to sequence (e.g. E. coli, nematode, Drosophila, and mouse--the yeast genome is finished).
Arraying and DNA preparation facilities could readily make the end-sequenced BAC clones available to the world-wide genome community. BAC clones could be readily mailed and BAC-end sequences or STCs and fingerprints would be available on the world wide web, as would the identity of any clone selected for sequencing. Several research teams could thus work on the same chromosomal region without unintended duplication of effort. This would facilitate international cooperation. With our proposed strategy, participating laboratories could sequence the BAC inserts by any cost-effective method. Likewise, any DNA sequencing chemistries now in existence could be used, as well as any future chemistries.
The complete set of BAC-end sequences and fingerprints could be obtained in two years or less employing, for example, 30 Applied Biosystems 377 sequencers for a total cost of $5-10 million. This cost is a small fraction of the yet to be incurred cost of sequence-ready physical mapping.
A highly cooperative combination of large genome centers and small groups could finish the entire human genome sequence in under ten years. The current cost of DNA sequencing is around $0.30/finished bp in the most efficient laboratories and it is anticipated that it will fall to $0.10-0.25/base in the next one to three years. At these costs, the total sequencing costs for the entire genome would be less than the genome funds expended to date. The British foundation, The Wellcome Trust, recently announced funding the Sanger Center to sequence 1/6 or more of the human genome (9). Scientists in France, Germany, and Japan are discussing doing as much as 10% of the human genome each, while the United States' effort is just beginning with the establishment of six genome sequencing centers for pilot scale-up studies (10). These large centers around the world, in conjunction with many smaller groups, require an improved approach to genome coordination. The sequence tagged connector strategy proposed here offers a powerful new approach to sequencing the human and other genomes with a maximized level of international cooperation, and with all participants working on an equal basis in a self-regulating, open scientific effort.
Last modified: Wednesday, October 22, 2003