TRANSCRIPTOME 2002: From Functional Genomics to Systems Biology
March 10-13, 2002
Seattle, Washington, USA

Size-Dependent Pareto-like Distributions in Genomics and Prediction of the Number of Protein-coding Genes in Human Genome

Vladimir A. Kuznetsov, National Institute of Child Health and Human Development, NIH, The Laboratory of Integrative and Medical Biophysics, Bethesda, MD

Recently, we described the class of skew size-dependent probability distributions that appear in samples provided by many large-scale gene expression experiments and by proteome and genome data sets /1-3/. The observed distributions have the following characteristic in common: there are few frequent and many rare classes. The form of the distribution systematically depends on size of the sample. We developed a stochastic model of population growth that leads us to a size-dependent Pareto-like probability distribution of classes by their frequencies of occurrences in multi-classes finite population. In this work, using SAGE data (www.sagenet.org), we statistically identified such distribution for the transcript abundance values in human colon cancer transcriptome presented by ~600,000 SAGE tags. We developed a new computational methodology to remove major experimental errors from SAGE database. The  corrected probability distribution of transcript abundance values in the human transcriptome was obtained.  A new method to estimate the number of genes was also developed. About 10,500  expressed protein-coding genes in a singe colon cancer human cell, and ~31,000 expressed protein-coding genes in a population of  human cells were estimated as the conservative numbers. Our Pareto-like model also fits to empirical frequency distributions of the protein domain occurrence values for distinct protein domains in 10 archaean, 25 bacterial and 6 eukaryotic proteomes of fully-sequenced genomes (www.ebi.ac.uk/interpro/). About 36,500 evolutionary conserve protein-coding genes were predicted for the entire human genome based on our extrapolation analysis of the relationships between the number of genes in the fully-sequenced genome organisms and the number of protein domain occurrences per proteome, and the numerical characteristics of the Pareto-like distribution of the protein domain occurrences in the proteome.1. V. A. Kuznetsov, R.F. Bonner (1999) Statistical tools for analysis of gene expression distributions! with missing data. In: 3rd Annual Conference on Computational Genomics. Nov.18-21. Baltimore, MD:The Institute for Genomic Research, p.26.2. V. A. Kuznetsov (2001) Distribution associated with stochastic processes of gene expression in a single eukaryotic cell. EURASIP J. on Applied Signal Processing, 4, 285-296. 3. V. A. Kuznetsov (2002) Statistics of the numbers of transcripts and protein sequences encoded in the genome. In: Computational and Statistical Methods to Genomics. Kluwer: Dordrecht  etc. pp.125-171.


Return to Table of Contents * Speaker Abstracts * Poster Abstracts * View the Photos

Return to Meetings Home Page

This site produced by the Human Genome Management Information System of Oak Ridge National Laboratory.

Disclaimer

Webmaster