Exceptional Chromosome Regions II



Challenges to Human Genome Sequencing: Not so Cryptic Duplications and the Genomic Abyss

Julie R. Korenberg, Xiao-Ning Chen, Pranay Bhattacharyya
Departments of Human Genetics and Pediatrics, UCLA and Cedars Sinai, Medical Genetics Birth Defects Center, Los Angeles, CA, 90048
And Collaborators*

Like doughnuts, the holes in the genome are not found in the final product. Most of the gaps in the finished sequence are due to the instability of genomic regions containing repeated sequences. This results in underrepresentation in recombinant libraries, misaligned clone maps and difficulties in sequencing. The underlying sequence structure of such regions ranges from clustered (centromeres, pericentromeres, telomeres,) or interspersed reiterated simple sequences to megabase sized highly conserved regions of genomic DNA. Such regions are important to sequence in order to elucidate their roles in cellular function, human genomic variability, germ line and somatic instability, and gene regulation. Therefore, in recognition of the unavoidable biases that were introduced in sequence sets and their descendants by such unstable regions, a set of BACs were defined at random and integrated with draft sequence to provide anchor points for sequencing centromeres, pericentromeres and duplications in chromosome arms. Clones likely to fill unsequenced gaps in chromosomes 5, 16 and 19 are defined below.

Clones from Exceptional Regions were defined as follows: Spanning the human genome, a total of 6,000 BACs were mapped by in situ hybridization, 3500 defined at random, 184 from screens using concensus alpha satellite, 346 using telomeric oligonucleotides and about 2000 from other screens of the Caltech BAC libraries A and B. STS linkage was established for about 957, end sequences for 272 and fingerprints for 976 clones.

Centromeric Regions: A total of 373 mapped to centromeric regions (~40Mb), of which 192 mapped to single centromeres (~20Mb), 150 mapped to multiples and 31 (~3.4Mb) to all human centromeres (defined as universals). For Chromosome 5, 43 BACs are centromeric, of which one is specific, five map to both 5 and 19, seven map to multiple chromosomes and 30 are universals. For chromosome 16, in addition to the universals, one BAC is specific, two map to three sites and six map to 6 or more sites. For chromosome 19, there are no specific BACs, five map to chromosomes 5 and 19, and five to 4 or more sites.

Chromosome Arm FISH Multisite Clones: A Majority were defined at random.

Fingerprint Data:
Of the total mapped BACs, 990 revealed multiple sites, of which 350 (of 3500) had been defined at random suggesting that a minimum of 10% of the genome was duplicated in addition to the known regions. Of the 990 multisite, 489 were fingerprinted, of which thirty-three with 5-29 bands had no database match, and 58 had too few bands, suggesting a minimum of 8 % of duplications (non centromeric) were not represented in the fingerprint database. For chromosome 5, eight of 56 BACs (14%) having at least one FISH site on chromosome 5 were not represented in the FP database and 3 were on orphan contigs. For chromosome 16, 13 of 51 BACs (25%) were not represented in the FP database, four were located on orphan and three at the ends of contigs. For chromosome 19, six of 22 (27%) were not represented in the FP database and one mapped to an orphan contig. For chromosomes 5, 16 and 19, a minimum of 35 BACs from the current dataset, or 50 from the complete multisite set may contribute to filling gaps in the draft sequence.

End Sequence Data based on 272 of 718 multisite BACs (2/2001)
Two hundred and seventy-two multisite BACs had at least one end sequence available (446 further remain to be analyzed), of which 160 had at least one database hit of greater than 80% homology, of which 95 (60%) were above 98% homology and 65 (40%) had hits of 80-98%. Three were located on orphan contigs. This suggested that at least 40% of the multisite clones detected repeated regions which were not included in the draft sequence. One hundred and twelve of the Multisite BACs or 41% had no match in the draft sequence.

Analyses 5.23.01 for Chromosome 5, 16, 19.
For chromosome 5, of 25 end sequenced multisite BACs two had no hits (8%) and 16 had hits of 98% (two on chromosome 5 and 4 of which had multiple sites) and 7 had hits below 98%, with two mapping on orphan contigs. This suggested that 13 of the original 25 represented repeated families for which all clones had not been sequenced. For chromosome 16, three of 26 multisite BACs had no hits in draft sequence. Of the 23 hits, 14 were to multiple sites (Less than 98%) and 9 were to single sites. This suggested that 12% (3/26) defined unsequenced repeats and 61% (14/23) defined repeats for which not all members had been sequenced, some on 16. For chromosome 19, 2/8 multisite BACs had perfect hits and the remaining 6 hits were less than 98%, or on multiple chromosomes, suggesting that 75% represented repeat families for which the original BAC had not been sequenced. This results in a minimum of 36 clones in the current dataset and 70 predicted for the complete set of multisite BACs, that may yield sequence information for 5, 16 and 19.

In summary, we have defined a subset of BAC reagents for duplicated regions in the genome, a number of which are neither mapped nor sequenced. End sequencing the remainder of the 446 multisite clones may provide defined reagents that, together may help to cover a total of ~11 Mb of regions related to or duplicated on chromosomes 5, 16 and 19. The clones defined in the current report as not present in the draft sequence may derive both from the random approach to clone selection and the use of an alternative library. Such end sequenced BAC clones that are not included in the genome draft sequence may provide one cost efficient source of BACs for filling gaps, for defining hotspots of genomic instability and for sequencing centromeric regions containing genes.

*SY Zhao (TIGR, Rockville, MD)
* M. Sekhon and J. McPherson (Washington Univ Genome Sequencing Center, St Louis, Missouri
*H. Shizuya and M. Simon (Celtech, Pasedena, CA)

Last modified: Wednesday, October 22, 2003

Base URL: www.ornl.gov/meetings/ecr2/

Site sponsored by the U.S. Department of Energy Office of Science, Office of Biological and Environmental Research, Human Genome Program