SNIDE - cSNp IDEntification using DNA sequence context only

     

    Horvath MM, Fondon JW III, and Garner HR "Low hanging fruit: A subset of human cSNPs is both highly nonuniform and predictable". Gene 2003. 312:197-206.

    Abstract:
        We present a point mutation classification method that contrasts SNP databases and has the potential to illuminate the relative mutational load of genes caused by codon bias. We group point variation gleaned from public databases by their wild-type and mutant codons, e.g. codon mutation classes (CMCs, 576 possible such as ACGàATG), whose frequencies in a database are assembled into a BLOSUM-style matrix describing the likelihood of observing all possible single base codon changes as tuned by the intertwined effects of mutation rate and selection. The rankings of the CMCs in any database are reshuffled according to the population stratification of the typical genotyping experiment producing that resource's data. Analysis of four independent databases reveals that a considerable fraction of mutation in functional genes can be described by a few CMCs regardless of gene identity or population stratification in the genotyping experiment. For example, the top 5% (29/576) of CMCs account for 27.4% of the observed variants in dbSNP while the bottom 5% account for only 0.02%. For nonsynonymous disease-causing mutation, 40.8% are described by the top 5% of all possible nonsilent CMCs (22/438). Overall, the most observed polymorphism is a GàA transition at CpG dinucleotides causing ACG, TCG, GCG, and CCG to frequently undergo silent mutation in any gene due to the putative nonimpact on the protein product. In order to assess how well CMC spectrums estimate the aggregate nonsynonymous mutational trends of a single gene, a CMC matrix was applied to seven unrelated genes to compute the most likely point mutations. In excess of 87% of these mutation predictions are historically known to play an important role in a disease state according to published literature. CMC-based mutation prediction may aid design and execution of direct association genotyping studies.  

     

     

    Supplementary data to the manuscript
    Supplementary data table 1     Codon first position G/A and third position C/T transitions disproportionately occur within codon-bridging CpG dinucleotides.
       
     
    Database-specific codon mutation class matrices
    For each cSNP dataset, mutations were categorized by their codon mutation class (CMC), which is the trinucleotide sequence context of the allele in terms of its wild type and mutant codon (for example, CGG-->TGG, an ARG-->Stop mutation). For the database being examined, CMC frequencies are weighted by the wild type codon usage and normalized to 1 to give the true observance frequency of each cSNP class. This in effect reduces a cSNP database to a mutation spectrum that can be quantitatively contrasted against spectra similarly calculated from other mutation repositories. As a result, differences in mutation preferences between databases can be defined and measured.
    1. HGMD codon mutation class matrix - calculated from the Human Genome Mutation Database (HGMD) on 5/24/2002. This mutation spectrum is skewed where only alleles known to cause disease are analyzed.
    2. dbSNP codon mutation class matrix- calculated from human dbSNP cSNPs on 10/13/2002. This spectrum represents mostly neutral and nonconservative mutation given that dbSNP is filled mostly by cSNPs found in low-throughput genotyping experiments.
    3. CGAP-GAI codon mutation class matrix - calculated from mouse CGAP-GAI cSNPs on 4/17/2002. This gives a mutation spectrum comparable to that of dbSNP, but for a mouse model. dbSNP and the CGAP data are highly similar in mutation preferences which supports the continued use of mice as a model system for asking questions about mutation processes and their impact on phenotype.
    4. TSC codon mutation class matrix - calculated from human SNP Consortium cSNPs, release 10, 4/15/2002. The SNP consortium is unique in that it sequences a single panel of 24 ethnically diverse individuals for mutations. The other databases examined do not employ a consistent cohort.
       
     
    Database-specific trinucleotide mutation class matrices for the HGMD resource
    Just as cSNPs can be categorized according to its coding frame, there are two other noncoding frames by which mutations can be sorted which as a result gives the frequency of codon bridging trinucleotide contexts in a database. .
    1. HGMD trinucleotide mutation class 1 matrix- calculated from human HGMD point mutations referenced in the noncoding frame viewed by a +1 base frameshift
    2. HGMD trinucleotide mutation class 2 matrix- calculated from human HGMD point mutations referenced in the noncoding frame viewed by a +2 base frameshift

     

    Useful public resources often used to locate known cSNPS
    Human Genome Mutation Database (HGMD)
    dbSNP at NCBI
    Cancer Genome Anatomy Project Genome Annotation Initiative (CGAP-GAI)
    The SNP Consortium database (TSC)