U.S., Russian Collaborators Devise Near-Perfect Method to Identify Genes Across Species Lines

Embargoed for Release Tuesday, Aug. 20 (1996?)

U.S., Russian Collaborators Devise Near-Perfect Method to Identify Genes across Species Lines

A new method can find a human gene if an analogous gene from any other life form is already known.

While other techniques already exist to find cross-species gene analogs, this method is far more accurate. So the new method - devised through the collaborative efforts of U.S. and Russian researchers working at the University of Southern California - is likely to find ready applications in biotechnology, evolutionary biology and medical research.

The researchers, two in Russia and one in the United States, describe the method in the Aug. 20 issue of the Proceedings of the National Academy of Science.

"Hunting for human genes is a massive, painstaking undertaking that typically takes years and costs tens or even hundreds of millions of dollars," says co-author Pavel A. Pevzner, Ph.D., a professor of mathematics and computer science at USC. "With this method, we can find a human gene if an analogous gene from another species has been identified. The species doesn't matter: mouse, chicken, frog. Anything alive can serve as a template to find human genes."

Many cancer-causing genes already identified in mice and other laboratory animals are thought to have analogs that cause cancer in humans. "All of this animal research can now be translated far more quickly into human gene sequences, and ultimately, we hope, into treatments and cures."

Pevzner and his Russian collaborators - Mikhail S. Gelfand, Ph.D., of the Institute of Protein Research at the Russian Academy, and Andrey A. Mironov, Ph.D., of the Laboratory of Mathematical Methods at the Russian National Center for Biotechnology - have devised a method that overcomes formidable obstacles.

In very simple life forms, such as bacteria, genes are written into the organism's hereditary material as continuous strings of information, recording genetic information in the four-letter base-pair AGCT alphabet of DNA.

In man and other vertebrates, the situation is much less straightforward, even though exactly the same alphabet of letters is used. A human gene, consisting of a message roughly 2,000 letters long, is typically broken into submessages called "exons." These exons are shuffled, seemingly at random, into a section of chromosomal DNA as many as a million letters long.

A typical human or mammalian gene can have 10 exons or more. Recently, scientists reported a gene written in 54 separate exons. Another, linked to breast cancer, has 27.

"This situation is comparable to a magazine article that begins on page one, continues on page 13, then takes up again on pages 43, 51, 53, 59, 70, 74, 80 and 91, with pages of advertising and other articles appearing in between," Pevzner explains. "We don't understand why these jumps occur or what purpose they serve. Thankfully, like a magazine, the exons stay in order. They don't jump backward. You always read in the same direction."

The jumps are inconsistent from species to species. An "article" in an insect edition of the genetic magazine will be printed differently than the same article appearing in a worm edition. "The pagination will be completely different," Pevzner explains, "and it will not be consistent: the information that appears on a single page in the human edition may be broken up into two in the wheat version, or vice versa."

The USC scientist continues, noting yet another complication: "The genes themselves, while related, are quite different. The mouse-edition gene is written in mouse language; the human-edition gene in human language. It's a little like German and English, which are related languages: many words are identical or similar, but many others are not. Nevertheless, to find the analogous genes, we must be able to recognize these differently spelled words written on different pages as the same message."

Even there, the complications do not end. If it were just a matter of picking out a known "magazine story," whether in mouse-DNA language or human-DNA language, from intervening material that was obviously advertising, the problem would be far less difficult. Perversely, the "advertising" can mimic the message. Long sequences of "junk DNA," as it's sometimes called, may be identical to parts of the message but not be part of the gene. Such sequences are meant to be skipped when the message is read.

Earlier methods for deciding what is advertising and what is story depended on statistics. To continue the magazine analogy, "it is something like going through back issues of the magazine and finding that human-gene 'stories' are less like to contain phrases like 'For Sale,' telephone numbers, and the dollar sign," Dr. Gelfand explains.

While better than random reconstruction, these statistical methods are inaccurate at best.

The method developed by Pevzner and his colleagues zeros in on the proper "pages" by making first a list of all pages that are potentially part of the "story" - all pages that seem to have sequences that are part of the message.

The software developed by the three researchers then automatically combines and recombines these pages into the set that makes the best fit. The method works best when a "target protein" is available to guide the search. All the stories in the genetic magazine are recipes for making proteins. If you have the protein, you know the way the recipe reads (though you don't know where to find it in the maze of advertising).

The method's accuracy with such guidance is always good, the scientists report, and often remarkable - 99% or 100% accurate.

The Proceedings paper contains a listing of trials of the method on nearly 100 different genes, 47 of them from mammals (mostly mice), 45 from other organisms, including bacteria.

For mammals, 40 of 47 reconstructions were perfect - 100% accurate. In six of the remaining cases, where the method did not give a perfect prediction, it came close, accurately predicting 94-97%.

Even the lone case in which the method seemed to fall down - predicting with 75% accuracy on the basis of mouse data - the failure was interesting. In this case, chicken data for the same gene were also available to use for predictions, and the prediction of the human gene from the chicken data was 100% accurate. "This is surprising, given that we think of humans as more closely related to mice than to chickens," Pevzner notes.

Even when the starting point of the reconstruction was target material from organisms evolutionarily extremely different from humans - bacteria, yeasts and others - 25 of the reconstructions were 100% accurate.

"We believe our method will prove extremely useful to researchers, not just in biotechnology, but also evolutionary biology," Pevzner says. "It will enable biologists to trace, with exceptional precision, exact degrees of difference between gene organization in different species. And it will help, we think, to establish evolutionary relationships between species."

The research was supported by grants from the U.S. Department of Energy, the Russian Fund for Fundamental Research, the Russian Human Genome Program, and the National Science Foundation's Young Investigator Program.

(9/5/96)

Dr. Pevzner is a resident of Marina del Rey, Calif. For more information, contact him at ppevzner@hto.usc.edu or (213) 740-2407.

Human Genome Project 1990–2003

The Human Genome Project (HGP) was an international 13-year effort, 1990 to 2003. Primary goals were to discover the complete set of human genes and make them accessible for further biological study, and determine the complete sequence of DNA bases in the human genome. See Timeline for more HGP history.

Human Genome News

Published from 1989 until 2002, this newsletter facilitated HGP communication, helped prevent duplication of research effort, and informed persons interested in genome research.

Human Genome Project Information Archive
1990–2003