Searching for the Protein Coding Genes in the Human Genome Sequence

Jean Weissenbach
91057, Evry Cedex France
telephone: 33 1 60 87 25 02
fax: 33 1 60 87 25 32
prestype: Platform
presenter: Jean Weissenbach

Jean Weissenbach, Hugues Roest Crollius, Olivier Jaillon, Alain Bernot, Lucie Friedlander,Abel Ureta Vidal, Gabor Gyapay, William Saurin
Genoscope and CNRS-FRE 2231, Evry, France

Despite the availability of most of the human genome sequence, the identification of genes on the DNA sequence remains a difficult task. We have built a search tool (called Exofish, for Exon FInding by Sequence Homology) that combines a specific setting of TBLASTX, output filtering and a collection of random DNA sequences reads representing at present a third of the genome of the pufferfish Tetraodon nigroviridis (closely related to Fugu). Exofish detects sequence matches in 2/3 of several sets of human genes with a backround of false positive matches below 1%.

Exofish has been successively applied to the December 99 and June 2000 versions of the "working draft" of the human genome. The latter analysis indicates that the protein coding gene number is now around 27,000-28,000, somewhat below our earlier estimates of 28,000-34,000.

Exofish analysis of the Unigene set of human ETSs indicates that about 50% of the coding fraction of the human genome is still missing in the public sequence databanks.

About 15% of the total number of exons detected by Exofish on human chromosome 22 fall outside annotated genes. A more detailed analysis of this annotation using new full length cDNA sequences, suggests however, that (1) most of the Exofish detected exons falling outside annotations actually belong to annotated genes, (2) many of the annotated genes are not yet accurately delimitated and (3) a number of these genes will merge together.

All these observations indicate that a valuable annotation of the human genome sequence still requires enlarged sets of additional sequence data (cDNAs, related genomes) for comparison purposes. In addition, since any sequence analysis method suffers some limitations, it is essential to rely on a panel of tools that are as diverse as possible.

