9th Annual Workshop, October 28-31, 1999
Co-sponsored by the U.S. Department of Energy
Genomic sequence analysis and novel gene characterization on human chromosome 21
Katheleen Gardiner, Dobromir Slavov, Roger Lucas, Andrew Fortna and Alla Rynditch
Eleanor Roosevelt Institute, Denver, Colorado, USA
Sequencing centers in Japan and Germany anticipate the completion of the entire sequence of human chromosome 21 by early 2000. Currently, >25 Mb of the total of 40Mb have been deposited in public databases. We are analyzing this sequence for gene and repeat sequence content, and for interesting/important features of genome organization. Novel genes and gene models discovered in this analysis are also being analyzed for expression patterns and for alternative processing. At present, we have detailed information on >7.5 Mb, deriving from various isochore classes and including ~1 Mb with a GC level >50%. Results in the following areas will be presented:
i) Gene identification: 45 genes have been found that either have previously described protein sequences or that represent new homologies or new members of gene families. For each of these, the number and size of all introns and exons (coding and noncoding) have been determined. Associations with CpG islands have also been assessed. A further 45 novel gene models are predicted based on exon predictions, CpG island identification and EST matches. In addition to models comprised of excellent exon predictions associated with dbEST matches, we also find excellent exon predictions lacking ESTs and EST matches lacking good exon predictions. This suggests a fourth class of novel gene may be lurking in the human genome: those lacking both ESTs and reliable exon prediction.
ii) Genome organization: More than 3/4 of "known" genes are associated with CpG islands; more than 1/2 of novel models likely are. Within 6.5 Mb, 14 of 41 known genes contain 22 introns >20 kb in size. In total, these large introns comprise >1 Mb, more than 15% of the DNA analyzed. In contrast to large amounts of intronic DNA, there are numerous examples of short intergenic distances. Of 24 well defined intergenic distances, 16 are less than 5 kb and 8 are less than 2 kb. In addition, two examples of overlapping/interdigitated genes are predicted.
iii) Expression analysis: Of 19 gene models tested, all but one were negative by Northern analysis. RT-PCR has been used to define expression in 11 tissues, including 8 and 10 week fetus and adult brain regions. The majority of genes show restricted expression and interesting patterns of alternative processing. Of seven models for which complete protein sequences have been deduced, five have no discernible protein homologies and two show only weak homologies.
In spite of significant successes, these analyses serve to emphasize several problems associated with attempts to interpret human genomic sequence information. These problems and potential solutions will also be discussed.