POMPOUS - Predicted Simple Sequence Repeat Polymorphisms

   This includes identification of likely polymorphic repeats in coding and UTRs as found by inspecting the UniGene Database as well as those found in introns and exons as found when GenBank is analyzed.

Access our repeat polymorphism prediction database and summary statistics using our
Repeat Polymorphism Search Tool.


Repeat Polymorphisms Within Gene Regions - Phenotypic and Evolutionary Implications

Wren JD, Forgacs E, Fondon JW 3rd, Pertsemlidis A, Cheng S, Gallardo T, Williams RS, Shohet RV, Minna JD, and Garner HR "Repeat Polymorphisms Within Gene Regions: Phenotypic and Evolutionary Implications". American Journal of Human Genetics (August 2000, Vol 67, p. 345-56)

Abstract
 
    We have developed an algorithm that predicted 11,265 potentially polymorphic tandem repeats within transcribed sequences. We estimate that 22% (2,207 out of 9,717) of the annotated clusters within UniGene contain at least one potentially polymorphic loci. Our predictions were tested by allelotyping a panel of ~30 individuals for 5% of these regions, confirming polymorphism for more than half the loci tested. Our study indicates that tandem repeat polymorphisms in genes are more common than generally believed. Roughly 8% of these loci are within coding sequence and if polymorphic, would result in frame shifts. Our catalog of putative polymorphic repeats within transcribed sequences comprises a large set of potentially phenotypic or disease causing loci. In addition, from the anomalous character of the repetitive sequences within unannotated clusters, we also conclude that the UniGene cluster count substantially overestimates the number of genes in the human genome. We hypothesize that polymorphisms in repeated sequences occur with some baseline distribution based upon repeat homogeneity, size and sequence composition, and deviations from that distribution are indicative of the nature of selection pressure at that locus. We find evidence of selective maintenance of the ability of some genes to respond very rapidly, perhaps even on intra-generational time scales, to fluctuating selective pressures.  

UniGene download date: March 13, 2001

Human
92149 records read
Total genes with repeats: 33200
Where the repeats were found:
5'----787----[1117]-----2382-----3' and 29378 unknown
How many repeats were found:
5'----948----[1437]-----3319-----3' and 39025 unknown

Genes with hairpins found: 6177
Where the hairpins were found:
5'----236----[734]-----173-----3' and 5062 unknown
How many hairpins were found:
5'----330----[926]-----256-----3' and 9950 unknown

Entries with coding regions given: 14966
Average 5'UTR length: 165 (from 2483735 bp)
Average CDS length: 1484 (from 22214384 bp)
Average 3'UTR length: 826 (from 12367305 bp)
Average size of unknown entry: 566 (from 43729132 bp)
Smallest entry=50
Largest entry=17734

Self-Similarity average : 4.684
Self-Complementarity average: 4
3'UTR obs/exp A/T ratio avg.: 0.108
 
 
Mouse
79916 records read
Total genes with repeats: 17857
Where the repeats were found:
5'----320----[508]-----1079-----3' and 16144 unknown
How many repeats were found:
5'----377----[692]-----1607-----3' and 19440 unknown

Genes with hairpins found: 1867
Where the hairpins were found:
5'----87----[302]-----36-----3' and 1455 unknown
How many hairpins were found:
5'----131----[356]-----46-----3' and 2049 unknown

Entries with coding regions given: 7170
Average 5'UTR length: 135 (from 971257 bp)
Average CDS length: 1530 (from 10976156 bp)
Average 3'UTR length: 559 (from 4014279 bp)
Average size of unknown entry: 418 (from 30452300 bp)
Smallest entry=53
Largest entry=17333

Self-Similarity average : 3.94
Self-Complementarity average: 3.639
3'UTR obs/exp A/T ratio avg.: 0.055
 
Rat
46258 records read
Total genes with repeats: 28289
Where the repeats were found:
5'----147----[263]-----569-----3' and 27401 unknown
How many repeats were found:
5'----176----[374]-----838-----3' and 30794 unknown

Genes with hairpins found: 1354
Where the hairpins were found:
5'----48----[161]-----11-----3' and 1138 unknown
How many hairpins were found:
5'----81----[187]-----23-----3' and 1412 unknown

Entries with coding regions given: 4268
Average 5'UTR length: 126 (from 539060 bp)
Average CDS length: 1516 (from 6473174 bp)
Average 3'UTR length: 525 (from 2244223 bp)
Average size of unknown entry: 479 (from 20131942 bp)
Smallest entry=51
Largest entry=16453

Self-Similarity average : 4.069
Self-Complementarity average: 3.772
3'UTR obs/exp A/T ratio avg.: 0.055
 
 
Cow
6789 records read
Total genes with repeats: 2184
Where the repeats were found:
5'----45----[80]-----120-----3' and 1963 unknown
How many repeats were found:
5'----49----[102]-----148-----3' and 2321 unknown

Genes with hairpins found: 236
Where the hairpins were found:
5'----7----[44]-----4-----3' and 181 unknown
How many hairpins were found:
5'----7----[51]-----4-----3' and 338 unknown

Entries with coding regions given: 1321
Average 5'UTR length: 86 (from 114799 bp)
Average CDS length: 1280 (from 1691438 bp)
Average 3'UTR length: 424 (from 561266 bp)
Average size of unknown entry: 487 (from 2666490 bp)
Smallest entry=56
Largest entry=12706

Self-Similarity average : 4.427
Self-Complementarity average: 3.906
3'UTR obs/exp A/T ratio avg.: 0.12
 
 
Zebrafish
10341 records read
Total genes with repeats: 3889
Where the repeats were found:
5'----32----[28]-----135-----3' and 3709 unknown
How many repeats were found:
5'----35----[34]-----192-----3' and 4480 unknown

Genes with hairpins found: 445
Where the hairpins were found:
5'----3----[24]-----5-----3' and 413 unknown
How many hairpins were found:
5'----8----[26]-----7-----3' and 590 unknown

Entries with coding regions given: 823
Average 5'UTR length: 115 (from 95298 bp)
Average CDS length: 1289 (from 1061284 bp)
Average 3'UTR length: 481 (from 396011 bp)
Average size of unknown entry: 517 (from 4929823 bp)
Smallest entry=101
Largest entry=10620

Self-Similarity average : 4.271
Self-Complementarity average: 3.911
3'UTR obs/exp A/T ratio avg.: 0.061
 
 

Computationally Assisted Polymorphic Marker Identification:  Identification and Verification of Multiple New 3p21.3 Polymorphic Markers

J. W. Fondon III, G. M. Mele, D. Cummings, A. Pande, J. Wren,  K. M. O’Brien, K. C. Kupfer, M. Lerman, J. D. Minna and H.R. Garner, “Computationally Assisted Polymorphic Marker Identification:  Identification and Verification of Multiple New 3p21.3 Polymorphic Markers”, Proc. Nat. Acad. Scie., 95:7514-7519, June 23, 1998.

Abstract

    A computational system for the prediction of polymorphic loci directly and efficiently from human genomic sequence was developed and verified.  A suite of programs, collectively called POMPOUS (POlymorphic Marker Prediction Of Ubiquitous Simple sequences) detects tandem repeats ranging from dinucleotides up to 250-mers, scores them according to predicted level of polymorphism, and designs appropriate flanking primers for PCR amplification.  This approach was validated on an approximately 750 kb region of human chromosome 3p21.3, involved in lung and breast carcinoma homozygous deletions.  Target DNA from 36 paired B lymphoblastoid and lung cancer lines was amplified and allelotyped for 33 loci predicted by POMPOUS to be variable in repeat size.  We found that among these 36 predominately Caucasian individuals 22 of the 33 (67%) predicted loci were polymorphic with an average heterozygosity of 0.42.  Allele loss in this region was found in 27/36 (75%) of the tumor lines using these markers. POMPOUS provides the genetic researcher with a new tool for the rapid and efficient identification of polymorphic markers, and through the creation of a World Wide Web server site, investigators can use POMPOUS to identify new polymorphic markers for their research.  A catalog of 13,261 potential polymorphic markers and associated primer sets has been created from the analysis of 141,779,504 base pairs of human genomic sequence in GenBank.  This data is available on our WWW site and will be periodically updated as GenBank is expanded and algorithm accuracy is improved.

We have also catalogued the simple sequence repeats likely to be found in the entire genome.  This includes intronic regions, where the repeats could be used as markers or could be involved in gene regulation.