Workshop on Complete cDNA Sequencing

Gaithersburg, MD
May 19, 1997

Participants and Photo

In the fall of 1996, several research teams worldwide announced plans for complete sequencing of cDNAs, the sturdy representatives of the cell's fragile messenger RNAs (mRNAs) for gene expression. A workshop for participants in the international Integrated Molecular Analysis of Genome Expression (I.M.A.G.E.) consortium [HGN 6(6), 3] was called to extend the highly beneficial infrastructure I.M.A.G.E. has provided since 1994 to the challenges of complete cDNA sequencing. The meeting was organized and chaired by Greg Lennon of the Lawrence Livermore National Laboratory (LLNL), with Marvin Stodolsky coordinating for the meeting sponsor, the DOE Office of Biological and Environmental Research. Scientists attended from France, Germany, Italy, Japan, Sweden, United Kingdom, and the United States.

Several participants are members of the subgroup EURO-IMAGE, whose goals include generating and sequencing a master set of unique full-length cDNA clones (based on I.M.A.G.E. consortium resources) representing 3000 transcripts and 6 Mb of finished sequence, obtaining high resolution and comparative functional mapping in human and model organisms of 1000 genes from the master set, and developing the I.M.A.G.E. consortium data base for easy access to an integrated view of the sequence, map, and expression data generated.

Agencies providing funds for cDNA efforts were represented at the workshop and include DOE, NIH, and the recently established non-profit Merck Genome Research Institute (MGRI) [HGN 8 (3-4, 9)]. Selected highlights follow of technical progress in complete cDNA sequencing, as reported by several workshop participants.

Workshop Highlights
Attendees addressed a wide range of topics, including the following:

Speakers projected that, with adequate support from funding agencies, participating laboratories could generate up to 15,000 full-length cDNA sequences in the coming year. With average cDNA lengths of 2 kb, this represents some 30 Mb of total sequence.

Technical Progress
It has long been recognized that expression of a single gene may culminate in production of several different mRNA transcripts, depending both on the gene and the source tissue. Added to this biological complexity are the technical challenges of converting mRNAs to the sturdier cDNAs. Libraries with abundant truncated products are the common result, particularly for longer source mRNAs. Strategies devised for alleviating this truncation problem were described by Takao Isogai (Helix Research Institute, Japan), Nobuo Nomura (Kazusa DNA Research Institute, Japan), John Quackenbush (The Institute for Genomic Research, (TIGR)) and M. Bento Soares (University of Iowa).

A greater proportion of full-length cDNA products can be obtained through the use of two tactics. One protocol type takes advantage of the unusual nucleotide "cap" on the 5' end of mRNAs. This requires that the extension of the first cDNA strand be long enough to protect the cap, as a contingency for final cDNA clone production. However, Soares reported that about one-third of the cDNA transcripts begin within the mRNA, as contrasted with the desired starts at the 3' end of the mRNA, giving rise to 3' truncations. This problem can be alleviated substantially by size fractionating the mRNAs and later selecting out the cDNA products with lengths equal to the size-sorted mRNA templates.

Hans Lehrach (Max Planck Institut für Molekulare Genetik, Germany) related the value of massively parallel oligomer fingerprinting of cDNAs. This is an economical way to screen a library for novel and longer, potentially full-length cDNAs. Optimal candidate cDNAs chosen by the Lehrach team at the Resource Center of the German Genome Project are being sequenced in the laboratory of Annemarie Poustka, Deutsches Krebsforschungszentrum.

More than one sequencing read is commonly necessary to display the complete sequence for cDNAs longer than a few hundred bases. Strategies for economical full-length sequencing were discussed by Greg Lennon and Richard Gibbs, Baylor College of Medicine. Sequence reads beyond 1000 bases now are being obtained with improvements to sequencing systems by a team led by Wilhelm Ansorge, European Molecular Biology Laboratory. Ansorge suggested that, for cDNAs shorter than 2 kb, good coverage could be achieved by two overlapping reads on complementary strands.

Giuseppe Borsani (Telethon Institute of Genetics and Medicine (TIGEM), Italy) reported on the benefits of the easily manipulated Drosophila model for studies of development and function to reveal roles represented by human cDNAs.

Mark Boguski (National Center for Biotechnology Information, NCBI) reported on the status of the dbEST cDNA sequence database and the Transcript Map, and made recommendations for the evolution needed under the impending new demands of complete DNA sequencing. He observed that each group will have its own selection criteria and sequencing priorities such as finding cancer genes, genes with Drosophila homologs, or genes that already have been mapped. Boguski coined the expression "the slicing problem" to describe the difficulties in avoiding undesirable duplication and redundancy due to overlapping choice categories. A possible solution would be to establish a registration and tracking database modeled after the successful European Bioinformatics Institute RHAlloc/RHdb approach used in the construction of the human transcript map. Such a database would include an investigator or center name and contact information, identifiers for the physical cDNA clones being sequenced and associated EST accession numbers, and sequencing status. When participants registered a clone that they intended to sequence, the database would detect and report overlaps with clones selected by other groups.

Toward the close of the workshop, attendees agreed that I.M.A.G.E. meetings should take place every six months to maintain necessary coordination and efficiency. The next meeting, organized by John Quackenbush, was held in September in conjunction with the Ninth International Genome Sequencing and Analysis Conference in Hilton Head, South Carolina. Washington University I.M.A.G.E. participants will organize the next workshop, tentatively planned to be continuous with the May, 1998 Human Genome Workshop at the Cold Spring Harbor Laboratory.

WWW addresses of cited institutions

WCCS Participants

Participants in the Workshop on Complete cDNA Sequencing

Gaithersburg, MD
May 19, 1997

Front, left to right: Takao Isogai, Wilhelm Ansorge, Marvin Stodolsky, Giuseppe Borsani, Annemarie Poustka, Kirsten Timms, Marcelo BentoSoares, M.J. Finley Austin, Elise Feingold, Hans Lehrach.
Back, left to right: Greg Lennon, Charles Auffray, Nobuo Nomura, Richard Gibbs, Michael Metzker, Stephan Wiemann, John Quackenbush, Mark Yandell, Mark Boguski, Cleo Naranjo, Joakim Lundeberg, Chris Mundy.
Not pictured: Carol Dahl, Marvin Frazier, Aristides Patrinos, Robert Strausberg


cDNAs represent mRNAs
When a gene is expressed, chromosomal DNA is transcribed into nuclear RNA molecules containing long non-coding introns separating the short protein-coding segments. A much shorter mRNA with a contiguous protein coding segment matures as the introns are naturally processed out within the nucleus. Processing of a single nuclear mRNA into a few different mature mRNAs is common, and depends on both the gene and the tissue or organ in which it is expressed. As a population, mRNAs range in length from a few hundred to a few thousand bases, depending on both the source gene and the nuclear RNA processing. mRNAs are exported to the ribosomal complexes of the cytoplasm, where they act as templates for protein synthesis. mRNAs can be short lived naturally and are also highly susceptible to degradative processes during harvesting from cells.

For analytical purposes, mRNAs can be worked up into the much sturdier libraries of double-stranded cDNAs. Using poly dT as a primer on the 3' poly A end of purified mRNAs, reverse transcriptase enzymes of viral origin polymerize the synthesis of a single-stranded DNA complement of the mRNA. These initial DNA transcripts often fail to extend to the 5' end of longer mRNAs. Using more routine biochemistries, the single-stranded DNA is converted into duplex DNA and combined with a DNA vector to support its propagation and maintenance as a DNA clone. Double-stranded DNAs are much more stable and less susceptible to degradative processes than their single-stranded mRNA predecessors.

All cDNA clones derived from a particular tissue constitute a library of clones representing the genes that were expressed when the source tissue was harvested. The analysis of libraries from many different tissues, and obtained under a variety of physiological conditions, will be necessary to decipher the organ-specific patterns of gene expression.


The need for coordination and infrastructure
In a cDNA library the numerical representation of particular cDNAs varies over a thousand-fold. The predominant members of cDNA libraries from all tissues are the genes for cellular maintenance functions. I.M.A.G.E.'s coordination minimizes the unwanted and expensive repetitive analysis of the already characterized cDNAs.

A single sequencing read of a few hundred cDNA bases is usually sufficient to serve as a distinguishing identifier (called an expressed sequence tag, or EST) of the predecessor mRNA. This approach was pioneered by Craig Venter, now director of TIGR, which made public a major EST data release in June 1997. High throughput production of ESTs from I.M.A.G.E. cDNAs has been predominantly funded by Merck & Co., with sequencing done at the Washington University (St. Louis) Genome Center Human EST Project. The ESTs are deposited in the public database dbEST ( which supports queries on the similarities of ESTs and cDNAs to cDNA molecules whose analysis is just beginning.

The I.M.A.G.E. consortium manages distribution of reference sets of cDNAs. In the U.S., libraries of cDNA clones representing many different tissues are donated to I.M.A.G.E. at Lawrence Livermore National Laboratory (LLNL) where the clones are placed into reference arrays and replicas are then provided to genome research centers and private sector resource distributors. Over 3 million clone replicas have been sent to over 1000 laboratories worldwide. End users return data on the analyzed clones. All data are entered into public databases, enabling researchers to compare this against their preliminary cDNA sequencing data and eliminate redundant efforts.

EST analyses of over 500,000 cDNA I.M.A.G.E. clones suggest that over 50,000 of the estimated 60,000 to 80,000 human genes are represented. I.M.A.G.E. researchers at LLNL are providing "subtracting cDNA reagents" to aid the production of new cDNA libraries (by M. Bento Soares) that preferentially contain clones not already represented in the current I.M.A.G.E. collection.


Support in the USA
Several funding agencies provide support for cDNA-related projects. To annotate developing chromosome maps, DOE in 1990 began dedicated support for improved cDNA library production, early EST generation by C. Venter's team, physical mapping of cDNAs onto chromosomes, and database support. High throughput correlations of cDNAs with the new BAC resources are also in progress. The sequencing of cDNAs corresponding to genes recognized during genomic sequencing is often a component of major chromosome sequencing projects.

The NIH National Human Genome Research Institute is also supporting research and development in cDNA library improvement and mouse cDNA library production. In the NIH-supported chromosome map development using radiation hybrid methodologies, about one-third of the markers are derived from ESTs. The source genes are thus mapped onto the chromosomes. Recently the NIH National Cancer Institute (NCI) began providing substantial support for cDNA library production and analysis in a major effort to identify cancer-related genes [see article, HGN 8(3-4), 8]. This effort, called the Cancer Genome Anatomy Project, CGAP, was described at the workshop by Carol Dahl and Robert Strausberg of the NCI.

Merck ( grant administrator M.J. Finley Austin ( spoke of supporting programs to characterize cDNAs representing disease genes, which include full length cDNA cloning and sequencing.

The utility of the mouse model for studying human diseases is being advanced with diverse collaborative support, including NIH for library construction and DOE for I.M.A.G.E. efforts at LLNL to array mouse cDNA libraries. Washington University (St. Louis) generates mouse ESTs for the clone arrays with support from the Howard Hughes Medical Institute.



Return to cDNA Sequencing


Return to Genome-Related Meetings

Please send questions or comments to