9th Annual Workshop, October 28-31, 1999
Co-sponsored by the U.S. Department of Energy
Mining the yeast genome expression and sequence data
EMBL Outstation -- Hinxton, European Bioinformatics Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge, United Kingdom
The rapid increase in the amounts and complexity of the bioinformatics data is creating new challenges of finding ways to transform this data into knowledge, and also opening new possibilities of pure in silico studies of genome functioning. First genomic scale data about gene expression have recently started to become available in addition to complete genome sequences and annotations. Among the first such public data are experiments by DeRisi et al (Science, Vol 278, 1997) regarding the diauxic shift in the complete yeast genome. We have used these data in combination with genome sequence and annotation data from the database MIPS as a case study of data mining in bioinformatics. Among other approaches we used a clustering algorithm based on discretizing the time-series of the expression measurement space to cluster potentially coregulated genes. We extracted the genome sequences upstream from genes for each cluster and used a specifically designed sequence pattern discovery algorithm SPEXS to look for common patterns in each cluster. The algorithm was able to discover sequence patterns that are potential transcription factor binding sites that can be expected to participate in the regulation of diauxic shift. For details see "Predicting Gene Regulatory Elements in Silico on a Genomic Scale" (A.Brazma, I.Jonassen, J.Vilo, E.Ukkonen), Genome Research, Vol. 8, Issue 11, 1202-1215, November 1998. One of the general conclusions from this research has been that using the existing public gene expression data we can mostly "rediscover" and explain previously known facts, while it seems that finer data are needed for real in silico discoveries. Therefore European Bioinformatics Institute is currently looking into the feasibility of establishing a public repository of DNA microarray-based gene expression data. We are interested in discussing the opinions of people involved in using these technologies.