Beyond the Identification of Transcribed Sequences: Functional and Expression Analysis

9th Annual Workshop, October 28-31, 1999

Co-sponsored by the U.S. Department of Energy


Analysis of novel genes from human chromosome 21: determination and characterization of complete protein sequences and examples of overlapping genes

Dobromir Slavov, Roger Lucas, Andrew Fortna and Katheleen Gardiner

Eleanor Roosevelt Institute
Denver, Colorado, USA

We are currently using both computer based and experimental approaches to identify and characterize novel genes within the genomic sequence of human chromosome 21. Exon prediction, EST database matches and CpG island identification together are highly efficient at demonstrating the presence of a gene within a segment of DNA. Determining the complete structure of a novel gene and verifying its expression, however, is often more challenging, in particular for genes lacking any significant protein similarities. Problems are compounded when low or restricted expression precludes obtaining information from Northerns or cDNA libraries.

Our preliminary gene identification is based on consistent exon prediction by at least Genscan and Grail programs and/or EST matches that show evidence of exon splicing. CpG island identification, information from RT-PCR and RACE experiments are then added to these data. By these means, we have deduced complete protein sequences for seven novel genes. Protein sizes range from ~250 amino acids to >1500 amino acids. Five proteins contain no discernible protein homologies or motifs; two show only distant homologies that provide no functional clues. None is positive by Northern analysis. Protein sequences have been examined for biochemical and structural features such as amino acid content, hydrophobicity, polarity, and presence of beta sheets and alpha helices. Rarely have such data shown unusual features.

In two other cases, we have evidence of potentially overlapping genes. In both cases, the gene on one strand is represented by consistent exon prediction but no ESTs, and the gene on the opposite strand is represented by one or more ESTs but by no convincing exon predictions. Consensus splice sites are found only on the appropriate strand in all cases. In one case, the exon prediction gene is located within an intron of the EST gene; in the other case, EST exons interdigitate with consistent exon prediction. Expression levels in all cases are low and/or restricted. Analysis of such gene models would be facilitated by corresponding mouse genomic sequence, by adding more coding sequence data to EST sequences, and by more comprehensive information on exon prediction false positive rates.

Return to Table of Contents