Human Genome Project Information. Click to return to home page.

Sponsored by the U.S. Department of Energy Human Genome Program

Human Genome News Archive Edition

Human Genome News, January 1992; 3(5)

Workshop on Computational Molecular Biology

The international workshop "Open Problems in Computational Molecular Biology" was held in Telluride Summer Research Center (Telluride, Colorado) on June 2-8, 1991. Sponsored by the DOE Human Genome Program, the meeting was organ ized by Andrzej Konopka (National Cancer Institute), Hugo Martinez (University of California, San Francisco), and Peter Salamon (San Diego State University) with Danielle Konings (University of Colorado at Boulder) as events coordinator.

The workshop brought together key researchers in computational biology, coding theory, and biomathematics from nine countries (Canada, China, France, Germany, Netherlands, Israel, Scotland, United States, and the former U.S.S.R.) to address the problem of identifying the kind of phenomena and principles that constitute biological coding (not only the mRNA protein-translation code).

Sequence-analysis software, which is becoming progressively faster and more powerful, graphical, and user friendly, is now routinely used. New computational architectures are beginning to be implemented for searching databases and comparing sequences.

Many studies in computer-assisted sequence research have been based on the assumption that the genomic code can be compared to a text carrying many messages written in many languages. Although this linguistic analogy was originally meant to be just a metaphor, it has been taken quite literally. An arbitrarily defined pattern in a nucleotide sequence has often been given the rank of a word in an alleged language responsible for an alleged (but often unknown) function. Published sequence-analysis papers are full of references to signals, codes, languages, texts, information, and similar terms that do not refer to concept or phenomenon. As a result, most scientific conclusions of the last 10 years were based on speculation and premature inferences from incomplete evidence.

Use of these arbitrary standards has created a real need for computational biologists to formulate carefully the very foundations of their field; the Telluride workshops are planned as a systematic forum for the exchange of pertinent ideas and results. The 1991 workshop, devoted entirely to the foundations of biolinguistics, dealt with several general topics, and participants reached the following conclusions.

Legitimacy of the linguistic metaphor as a research tool

Conclusions:

  • Detailed methodological guidelines for conducting statistical and heuristic experiments are urgently needed.
  • Computational sequence research logic and terminology should be developed.

Structural patterns and the physiological conditions in which they can be expressed as "signals"

Conclusions:

  • No methodology exists for assessing structural pattern significance to allow systematic consideration of conditions in which patterns are to be expressed.
  • This methodology gap urgently needs to be addressed because a large amount of sequence and structure data will emerge from genome sequencing efforts.

Methods of assessing functional significance without knowing sequence function

Conclusions:

  • A given biological function can be represented as a collection of properly aligned sequences or structures [functionally equivalent sequences (FES)].
  • A set of properties (a profile) can be systematically assigned to a given FES or even to particular FES regions.
  • The vocabulary of profiles assigned to a given FES can be considered a classification code and used for discriminant analysis purposes.
  • No clear methodology exists for deciding whether and to what extent a givenclassification code corresponds to the actual functional code. The correlation would have to account for overlapping messages, which are inevitable because two or more different functions can be (and often are) encoded in a given nucleic acid sequence.

Technical aspects of biomathematics

Conclusions:

  • Statistically, genomic sequences are inhomogeneous, and models that require knowledge of prior distributions of predefined patterns are usually arbitrary (i.e., their selection cannot be justified by sufficient knowledge of the modeled system).
  • Many statistical dependencies in genomic sequences exist at various levels of pattern definition, and most are not detectable by routine statistical approaches.
  • Methods for identifying hidden dependencies need to be developed to design correct statistical models of genomic sequences. Alternatively, statistical techniques not requiring knowledge of prior distributions could be formulated.

Coding theory, cryptology, and information theory

Conclusions:

  • To understand sequence data through a linguistic framework, focus is needed on alleged-language pragmatics (i.e., utterance of predefined patterns in a set of known conditions).
  • Complex cellular machinery and insufficient knowledge of its detailed workings prevent consideration of alleged-message syntactic and semantic aspects in genomic sequences at this time.
  • Existing coding theory deals with decoder properties that were designed to serve telecommunication systems and digital computers, a basis with little relevance to systems of unknown design (i.e., to most communication models involving genome expression).
  • A biolinguistics coding theory is required to establish code words of undetermined message units (i.e., words in an unknown language). Before possible theories can be explored, however, databases of significant patterns and associated conditions of their expression must be created.
  • A new, pragmatic information theory is likely to emerge as the body of available sequence and structure data grows.

The workshop promoted formal and informal exchanges of ideas, and sessions were vigorous and lively. Many promising collaborations were initiated, including a nonorthodox application of Kullback entropy to computational molecular biology, a study of the evolution of recombination machinery as a pattern-recognition system, statistical modeling, and the inclusion of thermodynamic properties of sequence fragments in sequence-analysis tasks.


Reported by Andrzej Konopka and Peter Salamon

Return to Table of Contents

The electronic form of the newsletter may be cited in the following style:
Human Genome Program, U.S. Department of Energy, Human Genome News (v3n5).

Human Genome Project 1990–2003

The Human Genome Project (HGP) was an international 13-year effort, 1990 to 2003. Primary goals were to discover the complete set of human genes and make them accessible for further biological study, and determine the complete sequence of DNA bases in the human genome. See Timeline for more HGP history.

Human Genome News

Published from 1989 until 2002, this newsletter facilitated HGP communication, helped prevent duplication of research effort, and informed persons interested in genome research.