Beyond the Identification of Transcribed Sequences: Functional and Expression Analysis

9th Annual Workshop, October 28-31, 1999

Co-sponsored by the U.S. Department of Energy


Analysis of uncharacterized human cDNAs which encode large proteins in brain: The Kazusa Approach

Osamu Ohara

Kazusa DNA Research Institute, Kisarazu, Chiba, Japan

We have conducted a cDNA sequencing project for these five years. To date, more than 1200 cDNA sequences have been determined and deposited to the public databases. Since the average size of our cDNAs is about 5 kb, the total number of the sequenced nucleotide residues exceeds 6 Mb. These sequence data greatly help interpretation of the human genome sequence and allow us to discover many interesting genes. These achievements are highly important in human genomics, but they are only early fruits of our project; our cDNA project has been also looking beyond the identification of unknown human transcripts. This is why we have taken quite a unique approach for human cDNA analysis, i.e., analysis of long cDNAs (>4 kb) which encode large proteins (>50 kDa). This approach makes our cDNA project clearly distinctive from others.

The reasons why we decided to take this approach are as follows: (1) Long cDNAs are missing pieces in human cDNA analyses by others currently going on worldwide; (2) a significant number of large proteins are specific to mammals, if we define proteins larger than 100 kDa as large proteins; (3) in most cases (>75 %), functions of the large proteins in mammals are classified into either cell structure/motility (e.g., cytoskeletal proteins, membrane skeletal proteins, and motor proteins), or cell communication/signaling (e.g., ion channels, receptors, adhesion molecules, and regulators of small-G proteins), or nucleic acid management (e.g., transcription factors, RNA binding proteins, and DNA binding proteins); (4) many positionally cloned gene products are large proteins consisting of multiple domains. Because we have been particularly interested in brain functions and genetic causes of neurological disorders, all these reasons strongly motivated us to implement analysis of long cDNAs encoding large proteins.

In my presentation, the data obtained so far will be overviewed through introduction of our database for Human Unidentified Gene-Encoded large proteins (the HUGE protein database, In addition, I would like to describe several new technical developments for achieving the analysis of large cDNAs because many serious problems have been overlooked or only vaguely anticipated in current cDNA analysis.

Return to Table of Contents