James W. Fickett
SmithKline Beecham Pharmaceuticals
King of Prussia, PA, USA
In the mapping of disease genes, in organism-wide gene inventories, and in drug discovery, a major challenge is to assign function to newly discovered genes. One important determinant of function is transcriptional context, and a number of experimental techniques are being developed and exploited to inventory the rough expression level of many genes in many particular tissues and disease states.
As genomic sequence accumulates, the analysis of transcriptional regulatory regions in the DNA will increasingly be able to contribute to the characterization of transcriptional context, and perhaps to the fundamental understanding of transcriptional regulation as well. Our goal is to write computer programs that take as input windows of DNA sequence and produce as output the particular conditions (possibly none) under which each window functions as a transcriptional activator or repressor.
Progress will be reported on the development of a prototype algorithm for the recognition of myotube-specific promoters and enhancers. This will include a discussion of a specialized data collection, characterization of the DNA-binding specificity of a number of transcription factors, co-occurrence patterns for multiple transcription factors, and a Bayesian scheme for choosing between the hypotheses of (1) myotube specific regulatory region, (2) other regulatory region, or (3) non-regulatory region.
Maria de Fatima Bonaldo(1), Greg Lennon(2) and Marcelo
(1)College of Physicians and Surgeons of Columbia University, New York, USA (2)The New York State Psychiatric Institute, New York, USA and (3)Lawrence Livermore National Laboratory, Berkeley, CA, USA
Large-scale single-pass sequencing of cDNA clones randomly picked from libraries has proven quite powerful to identify genes and the use of normalized libraries in which the frequency of all cDNAs is within a narrow range has been shown to expedite the process by minimizing the redundant identification of the most prevalent mRNAs. However, given the large scale nature of the ongoing sequencing efforts and the fact that a significant fraction of the human genes has been identified already, the discovery of novel cDNAs is becoming increasingly more challenging. In an effort to expedite this process as we strive towards the ultimate goal of identifying the majority of the human genes, we have developed and applied subtractive hybridization strategies to eliminate pools of sequenced cDNAs from libraries yet to be surveyed. Briefly, single-stranded DNA obtained from pools of arrayed and sequence I.M.A.G.E. clones are used as templates for PCR amplification of cDNA inserts with flanking T7 and T3 primers. PCR amplification products are then used as drivers in hybridizations with normalized libraries in the form of single-stranded circles. The remaining single-stranded circles (subtracted library) are purified by hydroxyapatite chromatography, converted to double-stranded circles and electroporated into bacteria. Preliminary characterization of a subtracted fetal liver-spleen library indicates that the procedure is effective to enhance the representation of novel cDNAs.
Towards the development of libraries enriched for full-length cDNAs
Kala Mayur(1), Maria de Fatima Bonaldo(1) and Marcelo
(1)College of Physicians and Surgeons of Columbia University, New York, USA (2)The New York State Psychiatric Institute, New York, USA
It is anticipated that the existence of libraries enriched for full-length cDNAs will greatly facilitate positional cloning projects aimed at the identification of disease genes. In an effort towards this goal we developed and tested an approach for construction of libraries enriched for full-length cDNAs which involves: (i) pooling of mRNAs from different sources, (ii) size fractionation of the mRNAs, (iii) cDNA synthesis from each individual RNA size fraction, (iv) size selection of the resulting cDNAs according to the size fraction of the starting mRNAs, (v) separate cloning and propagation of each individual sub-library. In a feasibility study that we have conducted to assess the effectiveness of our procedure, we have generated 5 sub-libraries from a mixture of size fractionated mRNAs from human infant brain and placenta. Preliminary characterization of these libraries indicate that they contain cDNAs ranging in size from 0.35 kb up to 10 kb. Sequence analysis suggests that over 50% of the clones in each sub-library is either full-length or near full-length. Work is in progress to develop similar resources from a more comprehensive pool of mRNAs.
J.G. Sutcliffe(1), P.E. Foye(1), L. de Lecea(1), M.
Carson(1), E. Thomas(1), B. Hilbush(2), K.W. Hasel(2)
(1)Dept. of Molecular Biology, The Scripps Research Institute, La Jolla, CA, USA (2)Digital Gene Technologies, Inc., La Jolla, CA, USA
We report a method called TOGA (TOtal Gene expression Analysis) that utilizes an 8-nucleotide sequence near the 3' ends of mRNAs, comprised of a 4-nucleotide restriction cleavage site and an adjacent 4-nucleotide parsing sequence, to give nearly every mRNA in the organism a distinct identity, regardless of whether the mRNA has been discovered previously. The restriction cleavage site defines the distance of the site from the 3' poly(A) tail for each mRNA. All 256 possible permutations of the adjacent 4 nucleotides are used one at a time as part of primer-binding sites in PCR-based assays performed on mRNA samples from tissue extracts to parse the expressed mRNAs into 256 pools, each of which is separated by electrophoresis for visualization of its product constituents. The 8-nucleotide sequence and measured length together provide a nearly unique digital address for each mRNA species. Data collected from manual and automated assays demonstrate that the appearance of a band generated by TOGA indicates the presence in the sample of an mRNA at a concentration proportional to band intensity. TOGA is reproducible, detects mRNAs of less than 0.001% prevalence, and with four iterations identifies and quantitates greater than 99% of the mRNAs expressed in a sample. Thus TOGA represents a powerful means for comparing mRNA expression profiles.
University of Calgary, Dept of Medical Genetics, Calgary, Alberta
Several genes have been cloned and mapped to the short arm of the human X chromosome between ZNF21 and DXS255 at Xp11.23-p11.22. These include the erythroid-specific transcription factor GATA1, the transcription factor for the enhancer E3 (TFE3), synaptophysin (SYP), and the genes for Wiscott-Aldrich syndrome (WASP), X-linked thrombocytopenia (WASP), Dent disease (CLCN5), and one form of synovial sarcoma. In addition, several new genes and expressed sequence tagged sites (ESTs) have been identified in this region including MG21, MG44, MG61, MG81, DXS1011E, DXS7469E, RNPL, IS2, IS3, IS4, IS7. The genes for both X-linked congenital stationary night blindness (CSNB1) and Island Eye disease (AIED) have been mapped to this region of Xp11. In an effort to positionally clone these two disease genes we have generated an extensive transcript map incorporating all known genes and ESTs, as well as 44 new ESTs, and imbedded it in a detailed physical map of this region.
We have previously constructed a 2 Mb physical contig in Xp11.23-11.22 consisting of yeast artificial chromosomes (YACs) and cosmid clones (Boycott et al., 1996). The contig was constructed from the content of 43 DNA markers (including 20 new sequence tagged sites (STSs) and 12 genes or expressed sequences) for 99 YACs and 130 cosmids. The contig spans the distal marker ZNF21 and extends proximally to include DXS255, with an average STS marker density of one every 50 kb. We have also mapped an additional five published ESTs on the physical contig and the relative order of previously known genes and expressed sequence tags (ESTs) in this region is predicted to be Xpter - ZNF21 - IS7 - DXS7465E (MG66) - DXS7927E (MG81) - RNPL, IS3, IS4, WASP, DXS1011E, DXS7465E (MG21) - DXS7466E (MG44) - GATA1 - IS2 - DXS7469E (Xp664) - TFE3 - SYP (DXS1007E) - CLCN5 - Xcen.
A minimal set of YACs and cosmids were used for the isolation of transcribed sequences by direct cDNA selection of retinal, brain, fetal brain, and placental cDNA sets. We have found direct selection to be 75% efficient and have mapped 44 new ESTs to the physical map of the region between ZNF21 and DXS255. We are currently isolating full length cDNAs in an effort to collapse the 44 ESTs into a limited set of genes. A detailed map incorporating physical, transcript, and genetic information will be valuable for identifying disease causing genes that map to this region of the X chromosome.
This work has been funded by the RP Research and ID Bebensee Foundations. KMB is the recipient of the RP Research Foundation Cook-McCann Studentship.
Boycott, K.M., Halley, G., Schlessinger, D., and N.T. Bech-Hansen. (1996). Genomics 33(3): 488.
Lucy R. Osborne(1,2), Stephen W. Scherer(1,2), Duane
Martindale(3), Johanna Rommens(1,2), Ben Koop(3), and
(1)Department of Genetics, The Hospital for Sick Children, Toronto; (2)Department of Molecular and Medical Genetics, The University of Toronto, Ontario; (3)University of Victoria, British Columbia, Canada.
Williams syndrome (WS) is a multi-system developmental disorder caused by the deletion of contiguous genes at 7q11.23. Hemizygosity of the elastin (ELN) gene can account for the vascular and connective tissue abnormalities observed in WS patients, but the genes that contribute to features such as infantile hypercalcemia, dysmorphic facies, and mental retardation remain to be identified. Identification of all genes within the deleted region will be necessary for a full understanding of the underlying etiology of WS. As part of our human chromosome 7 mapping project, we are constructing a physical map of the WS region using a combination of cosmids, YACs, and PACs. Although well-characterized YAC contigs flanking the WS region have been established, YACs in the WS commonly deleted interval appear to be highly unstable. We have thus far isolated overlapping 'sequence-ready' cosmid and PAC clones covering approximately 1 Mb of DNA from within the deleted region.
Our goal is to generate a transcription map of the entire deleted region (estimated to be 1.5 Mb) using a combination of direct genomic sequencing and cDNA selection. A comprehensive description of genes localized to this region should be useful in understanding the genotype-phenotype correlation of this syndrome. We initially tested this combination of gene identification strategies on 500 kb of DNA from within the WS deletion. Nine cosmids representing the minimal tiling path of 500 kb of DNA were individually sheared and subcloned into M13 sequencing vectors, and 150-200 random clones picked and sequenced by fluorescence-labeled primers. The sequencing data were assembled into contigs and searched for open-reading frames and matches to known genes and ESTs. In addition, the same cosmid tile was used for cDNA selection using a variety of cDNA pools, and the retrieved clones were sequenced and analyzed for homology to known genes or ESTs.
cDNA selection experiments revealed three putative genes within the 500 kb region studied. None of these transcription units had been identified previously, although one, which showed homology to the RRM/RNP group of RNA-binding proteins, had been isolated as an EST (HUMORF D26068). The other two units showed no similarity to any known gene but were used to isolate longer clones from a fetal brain cDNA library. Genomic sequencing isolated seven genes from within the region, including the RNA-binding protein. Three of these genes had been cloned previously (ELN, LIMK, and RFC2), one had homology to restin and two others corresponded to ESTs. All genes contained identifiable exons. In addition, a number of ESTs were identified whose translation products showed most similarity to prokaryotic proteins. These were most likely due to E. coli contamination of the cosmid sublibraries used for sequencing.
While it will be important to extend our contig to cover the entire commonly deleted region in WS in order to obtain a complete description of the genes for further analysis, the direct DNA sequencing strategy combined with searching for ESTs present in public databases appears to be particularly effective. It may become more robust than direct cDNA selection procedure for gene identification as the number of EST entries continues to increase. It is clear, however, that no one technique will identify all the genes within a given part of the genome. In light of these experiments we are now extending both the cDNA selection and genomic sequencing strategies to include gene identification in the remaining interval of DNA from within the common WS deletion.
I. Expression patterns. Gene identification efforts on human chromosome 21 have been effective in isolating partial cDNAs for a significant percentage of the genes contained within it. Because the original biological motivation for this work was an understanding of the Down Syndrome phenotype, the next step is to determine the developmental expression patterns and the functional roles of these genes. To determine expression patterns, 230 non-overlapping, non-repetitive cDNAs have been arrayed and screened with mRNA from 4 stages of fetal pig (brain and body) and 2 ages of mouse brain. This has defined a set of cDNAs expressed in developing brain and has defined a smaller subset that show potentially interesting differential expression patterns. In a continuing search for sequence homologies that may assist in functional determinations, all cDNAs are periodically used in Blast searches. Recent results indicate that a significant proportion of even the non-rare messages are still not represented either by functional matches or in the EST databases.
II. Isochore boundaries. The mammalian genome is a mosaic of regions of varying base composition (the isochores). Previous transcriptional mapping efforts have made clear that the gene distribution on 21q correlates well with the patterns in base composition. Recently, Fukagawa et al. cloned the transition between an AT-rich region and a GC-rich region on human chromosome 6, and showed that the segment had high homology to the PseudoAutsomal Boundary of the sex chromosomes. They termed this sequence PABL (PAB-like) and showed that similar sequences occur in many copies in the human genome. Because there are many boundaries between regions of differing base composition on chromosome 21, we screened a chromosome 21 cosmid library with a PABL sequence. We have identified numerous PABL containing clones, a subset of which have been sequenced and placed on the physical map. Interestingly, some appear to be expressed and their locations correspond well with predictions from the base compositional map. Potential structural or functional or organizational roles for such sequences will be of interest to explore.
Ian N. Hampson, Lynne Hampson and T. Michael Dexter
Paterson Institute of Cancer Research, Department of Experimental Haematology, Christie Hospital, Manchester, United Kingdom
We originally developed the method of Chemical Cross Linking Subtraction (CCLS) (1 and abstract by Joanne M. Walter et al). The main limitation to this approach is the availability of mRNA. We now describe a rapid method of global PCR amplification of cDNA such that the strand sense is maintained. The products of this process are random primed fragments which facilitates uniform PCR amplification of total cDNA. Directional incorporation of an RNA polymerase initiator/promoter sequence allows efficient synthesis of total sense RNA from this material and the use of a biotinylated primer permits the separation of single stranded cDNA. Isolation of these products from different cell types provides a renewable source of single stranded target cDNA and driver RNA from limited cell numbers and we demonstrate their use for CCLS subtractive cloning of differentially expressed cDNAs.
References Hampson I.N., Hampson L. (previously Pope), Cowling G.J. and Decter T.M., Nucl. Acids Res. 20, pp 2899, 1992
Sally H. Cross, Victoria H. Clark and Adrian P. Bird
Institute of Cell and Molecular Biology, Edinburgh University, Edinburgh, Scotland
One of the major challenges in gene discovery, when using a positional cloning approach, is the identification of transcripts in genomic clones. Once the region in which a gene of interest resides has been narrowed down to a few kilobases a variety of techniques are employed to identify genes, for example exon trapping. An approach which we are developing involves the isolation of intact CpG islands from genomic clones which can then be used as probes to pull out full-length transcripts from cDNA libraries. CpG islands are short genomic regions of about 1 kb which are GC-rich and unmethylated. There are about 45,000 CpG islands in the human genome which mark and overlap the 5 end of 60% of genes. Using a methyl-CpG binding column we have exploited the unusual base composition of CpG islands in order to purify them from both genomic and cloned DNAs. To date we have constructed whole genome CpG island libraries from human, mouse and chicken. In addition we have single chromosome CpG island libraries from human chromosomes 22 and 18. One important feature of CpG island libraries is that the representation of each CpG island is independent of the expression pattern of the associated gene. Furthermore CpG islands are usually single-copy and therefore make ideal probes. We have also isolated CpG islands from a human cosmid clone and are currently applying the technique to a human PAC clone.
CpG islands are important genomic regions because they contain both promoter and transcribed parts of genes. Because of their high GC-content they are likely to be the regions which are most difficult to sequence in the human genome sequencing project. Therefore although CpG islands are not associated with all genes their isolation provides an important resource for genome projects.
Steven J.M. Jones and Richard Durbin
The Sanger Centre, Wellcome Tust Genome Campus, Cambridge, United Kingdom
The C. elegans genomic sequencing project has so far derived over 46MB of sequence from the 100MB genome. Currently over 7,300 protein coding genes have been predicted and over 330 tRNA genes. The total number of protein coding genes within C. elegans is estimated to be in the range of 14-15,000. Approximately 50% of the predicted proteins show similarity to previously determined protein sequences in the public databases.
Protein coding genes are initially predicted using GENE FINDER (P. Green and L. Hillier, unpublished). GENE FINDER utilizes known codon biases in C. elegans coding regions, splice sites, start and stop signals to identify putative exons and dynamically attempts to predict a set of non-overlapping genes for a given sequence. Additionally, Hexamer (or diamino) frequencies are also now being used to identify putative protein coding sequences.
The initial GENE FINDER predictions are consolidated with C. elegans EST and protein homology data. BLASTN (Altshul et al. 1990) in conjunction with MSPcrunch (Sonnhammer and Durbin 1994) is used to map ESTs to the genomic sequence and confirm exonic sequences. ESTs also represent the only way that alternate splice variants can currently be detected in previously uncharacterised genes. Currently around 31% of the predicted genes possess one or more ESTs sequences. Exons are also detected by comparing the C. elegans genomic sequence to previously determined protein sequences using BLASTX and MSPcrunch. tRNA genes are predicted using the TRNASCAN method of Fichant and Burks (1991).
The database ACEDB (R. Durbin and J. Thierry-Mieg, unpublished) provides the environment within which all the gene appraisal, editing and annotation is carried out.
Altshul et al. (1990) J. Mol. Biol. 215:403-410. Fichant G.A. and Burks C.J. (1991) J. Mol. Biol. 220:659-671. Sonnhamer E.L. and Durbin R. (1994) Comput. Appl. Biosci. 10(3):301-307.