9th Annual Workshop, October 28-31, 1999
Co-sponsored by the U.S. Department of Energy
Comparison of completed genomes to sample sequences of related genomes
Sandra W. Clifton(1), Michael McClelland(2), Webb Miller(3), William R. Pearson(4), Aaron J. Mackey(4), and Richard K. Wilson(1)
1Genome Sequencing Center, Wash Univ School of Medicine, St. Louis,
2SKCC, San Diego, California, USA
3Dept Computer Science & Engineering, Penn State Univ, University Park, PA, USA
4Dept of Biochemistry, University of Virginia, Charlottesville, Virginia, USA
Complete genomes provide a useful framework for organizing and analyzing partial sequences from related genomes. A sample consisting of 2X or 3X genome equivalents gives coverage of over 90% of the genome in which more than 99% of all ORFs over 500 bases in length should be represented by a fragment of at least 100 bases. Information on the presence of shared ORFs and partial identities of unique ORFs can be obtained at a fraction of the cost of complete sequencing.
To determine the utility of sample sequences, we have collected data from two Enterobacteria, Salmonella paratyphi A (SPA), and a clinical isolate of Klebsiella pneumoniae (KPN). These strains are of interest as human pathogens and for understanding enterobacterial evolution. SPA is very closely related to the completed Salmonella genomes, whereas, KPN is a sister clade of Salmonella and Escherichia. Over 10 million bases of raw sequence, representing between 2X and 3X genome equivalents, were collected from both SPA and KPN, which melded to 4,384 kb and 5,084 kb, respectively.
For Enterobacteriaceae, the E. coli K12 genome (ECO) is completely sequenced [U.Wisconsin] and the genomes of Yersinia pestis (YPE), Salmonella typhi (STY) [Sanger Center] and S. typhimurium LT2 (STM)[Wash. U., http://genome.wustl.edu/gsc/bacterial/salmonella.shtml] are soon to be completed. The ECO sequence has been aligned to the available sequence from each of STM, STY, SPA, KPN, YPE, and Vibrio cholera. These alignments can be viewed as a "percent identity plot" or PIP, in which percent identities of ungapped matches are shown in the Y-axis for each pairwise comparison. Deletions in the sampled genomes and the sites of rearrangements and of significant insertions are visualized in color. The alignments can be queried with any named ECO gene and the corresponding region is visualized in multiple genomes, simultaneously. Matching sequences in each aligned genome, associated with the reference gene and flanking regions, are automatically made available.
Unique portions of the complete and sampled genomes were identified with the FASTX and TFASTX programs. To search for unique regions and potential rearrangements in the sampled genomes, each sampled sequence is compared to the E. coli proteome using FASTX and the complete E. coli proteome is compared against partially sequenced genome databases using the TFASTX program. We present lists of (a) all ORFs found in ECO for which orthologues are apparently absent in the STM, SPA, or KPN samples,(b) sequences over 400 bp in length that are found in one or more of STM, SPA or KPN, but are absent in ECO. The best homologues of these "unique" regions are determined from other sequence databases, including incomplete genomes deposited at NCBI.