Sponsored by the U.S. Department of Energy Human Genome Program
Human Genome News Archive Edition
Human Genome News, April-June 1996; 7(6)
Santa Fe '96
LANL: Scanning the Genome with SASE Sequencing
Researchers at the LANL human genome center (http://www.lanl.gov/orgs/b/about_us.shtml) are testing a large-scale sequencing approach designed to identify genes quickly while capitalizing on LANL's high-resolution maps of chromosome 16. The dense clone coverage of the chromosome now at about 98% and mostly in cosmids provides an ideal framework for sequencing [Doggett et al., Nature 377, 335-65 (1995)].
LANL's strategy, outlined by Darrell Ricke, is to skim through chromosome 16 using a random (or "shotgun") method, in which they break 40-kb cosmid clones into 3-kb pieces, subclone them, and sequence both ends of each subclone. The sequences are analyzed using a new sequence-comparison system, and interesting regions (such as coding areas) are identified for more detailed, "finished" sequencing.
The focus on regions of immediate interest makes this a low-investment (one-tenth the price of finished genomic sequencing), potentially high payoff strategy. Another advantage of the approach is that identifying the genes and exons via sequence analysis provides more information than simply mapping ESTs.
The front end of this approach, which LANL calls SASE (for sampled sequencing), will allow LANL to rapidly generate aligned sequences along the chromosome 16 map. Sequencing both ends of a 1x sampling of subcloned cosmid fragments, along with cosmid end sequences, yields 70% sequence coverage with 98% clone coverage. The majority of this clone coverage is ordered by the relationships among the subclone end sequences, which are ideal substrates for directed sequencing strategies. At LANL, finished sequencing is done rapidly by parallel primer walking along the original cosmid DNA.
All SASE data are deposited in Genome Sequence Data Base and remain readily available for analysis. A notation is placed on particular sequences already targeted for finishing.
The data are sufficient to allow PCR amplification of the sequenced region, eliminating the need for extensive clone archiving and distribution and enabling many laboratories to participate in completing the sequencing of chromosome 16. LANL will SASE sequence cosmids to determine a minimum tiling set of 3-kb subclones, and LBNL will then finish sequencing the targeted LANL cosmids. This collaboration leverages the strengths of both laboratories and increases productivity. LANL is also collaborating with The Institute for Genomic Research in supplying map information for sequencing a portion of 16p.
Ricke acknowledged that some regions of the chromosome would be missed, but an early concentration on potentially interesting areas makes this strategy attractive (see "Telomeres"). Genomic areas with lower information content could be finished later, he noted, when sequencing technologies are more cost-effective.
Targets for Finished Sequencing
To identify the genes in SASE data, LANL developed the SCAN (Sequence Comparison ANalysis) program. SCAN integrates the results from BLAST and FASTA searches and will soon add GenQuest and GRAIL servers and Smith-Waterman. When SCAN finds a repeat, vector, Escherichia coli homology, or rodent homology, it gives a 1- to 2-line summary. All results are integrated and a summary report generated, with HTML pages that are hotlinked to other databases.
The sequence-analysis and automatic annotation features make SCAN an efficient program for large-scale projects. Results are displayed in a multiple-sequence alignment that gives a quick overview. Cosmids can be screened within minutes to reveal homologies.
The initial 1-Mb region of 16p completed by SASE has proven to be very gene rich, and investigators plan to finish sequencing the region and continue to adjacent regions of the 4-Mb high-resolution physical map. Projected sequencing throughput would allow complete SASE analysis of the 90-Mb euchromatic arms of this chromosome in just a few years.
SASE Sequence Analysis
David Torney explained a new technique for identifying coding sequences. This technique involves converting DNA sequences into binary sequences of 0s and 1s and then determining the "parities" of subsequences. The parity takes one of two values, depending on whether the number of 1s in the subsequences is even or odd.
"This technique comprehensively captures the features of coding sequences," Torney said. For example, a group of sequences of length n can be characterized completely by the average parities for all 2n subsequences.
Thus, investigators can make accurate subsequence classifications based on differences between coding and noncoding sequences. Subsequences with the fewest letters were found to be the most discriminating. Using this approach, LANL scientists correctly classified both coding and noncoding 54-base sequences 72.5% of the time. If, in addition, the sequence frame and strand were known, the correct classification rate was 82%.
LLNL: Linking Production Sequencing to the Underlying Biology
At the LLNL human genome center, large-scale chromosome 19 sequencing is coupled with understanding the human genes involved in DNA repair (see "Spell Checking the Genome").
These interests are rooted in DOE's mission to develop better technologies for measuring health effects, particularly mutations. Alterations in DNA repair genes can predispose individuals to cancer. LLNL researchers have cloned six different genes involved in repair processes. The focus has been on three of the genes that feed directly into genomic sequencing and the downstream biology aimed at elucidating repair processes. Low-pass sequencing approaches are being developed to minimize redundancy, increase throughput, find genes, and identify candidate regions for higher-redundancy sequencing.
Chromosome 19 sequence data are analyzed and used to generate targeting constructs in making transgenic mouse models and for characterizing the structure and function of these repair genes. LLNL is complementing genomic sequencing with sequencing of full-length cDNAs mapped to chromosome 19 cosmids. The genome center is also performing comparative sequence analyses of the mouse and human genomes, especially in DNA repair gene regions, for elucidating coding structure and identifying putative regulatory regions. These latter projects are being done in collaboration with researchers at Oak Ridge National Laboratory (see "Leaping Across Genomes").
With the chromosome 19 map completed, the LLNL genome center was reorganized recently to scale up the sequencing facility and provide high-throughput, high-accuracy sequence for the entire chromosome. Jane Lamerdin discussed some components of the random shotgun strategy, which includes use of a modified LBNL colony picker, 3 Beckman Biomek 1000s, and production of 600 templates a day. Center scientists are now running 10 to 15 gels a day using an ABI autoassembly package and Phred and Phrap software. Lamerdin estimates that, with an 8-fold redundancy and 70% success rate, current capabilities are about 4 Mb of finished sequence a year. Data analysis has been a bottleneck.
Joe Balch is leading an LLNL team to develop a next-generation DNA sequencer based on arrays of microchannels etched and sealed in a glass substrate as an alternative to arrays of discrete glass capillaries. The plan is to develop a 96-channel array system first and a 384-channel array system later to sequence DNA samples in less than 2 hours.
DNA Repair Gene Analysis
The LLNL genome center has generated some 1.2 Mb of genomic sequence, using a random shotgun strategy with an 8-fold average redundancy and getting complete double-stranded coverage where possible. The primary effort has been targeted to cosmids containing the human DNA repair genes HHR23A, XRCC1, and ERCC2 on chromosome 19, ERCC4 on chromosome 16, XRCC3 on chromosome 14, and XRCC2 on chromosome 7, as well as selected rodent homologs. Genomic sequencing is also being used as a gene-discovery method in a chromosome 19 targeted region (19p13.1) associated with olfactory receptors and a congenital kidney disease.
The LLNL group sequenced a total of 76 kb containing human and mouse XRCC1 genes, identifying coding regions and nine conserved elements. They also completed 54 kb of human sequence encompassing the ERCC2 gene and 54 kb spanning the syntenic regions in mouse and hamster. A defect in this gene leads to the disorder xeroderma pigmentosum, in which some people have extreme uv sensitivity and are very prone to cancer. Other phenotypic effects include neurological defects and a defect in sulfur metabolism characterized by brittle hair. Sequence analysis of the ERCC2 gene by Christine Weber detected no single location for mutations leading to a particular defect; all seem to map to the last third of the gene. Structural analysis of the protein sequence may provide interesting clues, including more precise association of mutations with phenotypes.
Researchers found that the human ERCC2 gene, comprising 23 exons (coding areas), is 98% identical to the rodent homolog at the protein level. They identified two genes flanking ERCC2; all three genes and their orientation are conserved in humans, mice, and hamsters. Gene products of ERCC2 and ERCC4 are involved in the nucleotide excision repair pathway that recognizes and removes DNA damage.
Lamerdin noted that these results underscore the power of using a comparative approach to finding genes because all the coding areas in the ERCC2 target region were not identified by gene-finding software.
Sequencing has been completed on a cosmid and its associated cDNA for the recently cloned human XRCC3 gene, which appears to play a crucial role in chromosomal stability. The predicted protein shares residue identity with the guanosine 5'-triphosphate binding domain of the Saccharomyces cerevisiae rad51 and rad57 proteins involved in recombinational repair. Sequence analyses of several candidate cDNAs for the XRCC2 gene also show similarity to the same domain in these proteins. Sequence analysis of the XRCC3-containing cosmid identified a kinesin light chain gene physically linked to a DNA repair gene.
LBNL: Scaling Up a Directed Approach
For the past year, the human genome center at Berkeley Lab has focused on building new mechanisms and technologies for large-scale human genome sequencing. The center consists of four groups: genome sequencing and technology, human physical mapping and biology, automation, and informatics.
The initial sequence target in the human genome consists of a 10-Mb growth-factor-rich region located at 5q31-q35. A collaborative project with the Berkeley-based Drosophila Genome Center uses technologies developed at LBNL to physically map and sequence the organism's genome. A new project with the Resource for Molecular Cytogenetics at LBNL and University of California, San Francisco (UCSF), will sequence a 600-kb region on human chromosome 20.
In 1995 Berkeley Lab completed almost 1.6 Mb of sequence, with over 700 kb primarily from human chromosome 5. Total sequence generated during the last 4 years is 3.7 Mb, with over 1 Mb of human sequence. All sequence is double stranded, and the error rate is less than 1 in 2500 bases. Researchers found the physical map based on P1 clones (average insert size, 80 kb) to be an excellent substrate for genomic sequencing. All human sequence was obtained by sequencing 3-kb subclones derived from the P1 physical-mapping clone set that spans the target region on chromosome 5. Sources of P1 include a chromosome 5 map generated by Eddie Rubin and Jan-Feng Cheng at LBNL and a chromosome 20 map by Joe Gray's group at LBNL and UCSF.
Chris Martin, head of the production sequencing group, described scale-up strategies for the directed sequencing approach, in which every sequencing template is first mapped to a resolution of 30 bp. The advantages of this approach include a large reduction in the number of sequencing reactions needed and in the sequence-assembly steps that follow. A key challenge, Martin noted, was the development of management and training structures that could scale up to the level needed.
Directed Sequencing Strategy
Four modular, highly adaptable components have been developed for the directed approach, with data quality monitored at each step. Most decision making has been automated. The approach involves shearing and subcloning the physical-mapping clone, end sequencing 192 of these subclones, and generating a minimal tiling path of subclones using custom software. Subsequent steps consist of generating and mapping transposon inserts in each subclone in the minimum tiling path and sequencing using commercial primer-binding sites engineered into the transposon. The sequence is then assembled using the high-resolution physical-mapping information produced by the preceding steps. A faster variation of the process was proposed in which a mostly single-stranded, or scaffold, sequence based on more widely spaced (600- to 1000-bp) transposons will be constructed. The data from the end-sequencing step will then be layered into this mapped, verified, and transposon-based scaffold sequence and used to develop completed sequence at even lower cost.
Sam Pitluck of the informatics group at Berkeley Laboratory described software tools developed in collaboration with Gene Meyers and Susan Larson (University of Arizona). The Fragment Assembly Kernal (FAK, written by Meyers) was chosen for its ability to handle up-front mapping information as constraints. The group built an interface to FAK using SPACE (Sequencing Platform using ACE), a variant of the ACeDB suite of database, analysis, and display software (Durbin and Mieg, 1991). ACeDB, developed originally for the C. elegans genome research community, has been used mostly as a database program. In SPACE the capabilities of trace editing, assembly, and fragment and contig display were added. SPACE is now being used in the production environment
Kelly Frazer, a recipient of a DOE Human Genome Distinguished Postdoctoral fellowship, described efforts to probe the primary sequence data from the Berkeley production group to discover genes and catalog expression patterns. The team is now analyzing 1.2 Mb from 5q31 that contain the interleukin cluster growth-factor genes.
An ongoing task at the center is to identify and reduce expensive commercial costs and the steps requiring human intervention. Toward these goals, changes include using the ABI catalyst, with its low-volume pipetting abilities, combined with the ABI 377, which can detect low sample amounts. These changes, along with discounted bulk purchases of disposables over the past 2 years, have reduced supply costs by half. Another important approach is the development of custom automated devices that reduce sequencing labor costs.The group also plans to change from commercial to custom size standards that are cheaper and work better with the imaging station. Operating costs are expected to fall to less than $0.25 per base fairly soon.
Joint Projects with Other DOE Centers
Collaborations between the Berkeley Lab and LANL are expanding. LANL's SASE end-sequencing data will be used to feed Berkeley's path-generation and subclone-sequencing components. Berkeley expects to develop similar collaborations with LLNL for sequencing chromosome 19.
The electronic form of the newsletter may be cited in the following style:
Human Genome Program, U.S. Department of Energy, Human Genome News (v7n6).
The Human Genome Project (HGP) was an international 13-year effort, 1990 to 2003. Primary goals were to discover the complete set of human genes and make them accessible for further biological study, and determine the complete sequence of DNA bases in the human genome. See Timeline for more HGP history.
Published from 1989 until 2002, this newsletter facilitated HGP communication, helped prevent duplication of research effort, and informed persons interested in genome research.