9th Annual Workshop, October 28-31, 1999
Co-sponsored by the U.S. Department of Energy
Extracting meaningful information from draft sequence
R. J. Mural, F. W. Larimer, M. B. Shah and E. C. Uberbacher
Computational Biosciences Section, Life Sciences Division, Oak Ridge National
Laboratory, Oak Ridge, Tennessee, USA
The plan to complete a 90% draft of the sequence of the human genome by next spring poses special problems for the annotation process. It is clear that a 3 to 5 X coverage of genomic DNA can yield large amounts of biologically useful information if the appropriate analysis methods can be applied. There are a number of features that can be located and annotated in draft-sequence which are useful for further analysis, these include: STS's that allow some clones to be assigned to chromosomes or chromosomal locations, BAC end-sequences (STC's) that will help to identify neighboring clones and help to build framework maps, and EST's that can provide gene identification information and in some cases map information. As catalogs of full length cDNAs become available they will be even more useful than EST's in helping to define biological content of draft sequence. These features can be annotated by standard similarity methods given sufficient computational resources.
Using various gene identification programs, particularly those that incorporate similarity data such as Grail-Exp, can provide another level to the analysis of draft data. Results from such analysis allows not only gene identification but it can also provide some internal ordering information for the sequence contigs that make up the clone being analyzed. Also recall that essentially all of the genes that can be found in finished sequence can be identified in draft sequence at about 3X coverage.
We have begun to modify the analysis pipeline that we have developed for finished sequence, the results of which can be viewed in the Genome Channel and the Genome Catalog, to provide analysis of draft data. The initial annotation of draft sequence is a catalog of the clone contents (STS's, STC's, genes models predicted by Grail-Exp and Genscan, as will as Blast searches of their translations against the NR protein database). Further analysis of this information will help to define relationships among draft clones and will allow ordering, within and between clones. To date we have analyzed over 3500 draft clones from human chromosomes 5,16 and 19 and we are building the data structures to handle other draft data.
Supported by the Office of Biological and Environmental Research, US DoE, contract DE-Ac05-84OR21400 with Lockheed Martin Energy Systems, Inc.