Home    ||    About DOC    ||    Hardware    ||    Software    ||    Applications & Demonstrations    ||    Data Access Portal   ||  Methods    ||    References & Papers


Software Design & Anaylsis

MaskDesigner, DOCOPs, DOC_U_MENTOR­ :  software for microarray design and layout, DOC instrument operation, and DOC readout
ARROGANT :  an array organizing tool
IRIDESCENT data mining code for the analysis of experimental data, hypothesis generation and exploration
ELXR:  code that identifies exons, CpG islands, and other features in the genome
SNP: prediction and vertification
PANORAMA:  DNA Sequence anaylsis and visualization


MaskDesigner, DOCOPs, DOC_U_MENTOR­ The DOC machine operations software consists of three components: the microarray design and layout, the DOC instrument operation software, and the DOC readout software.  The input for the microarray design software is a FASTA file containing the sequences to be analyzed.  The design software creates the probes for the specified application (resequencing, methylation analysis, gene expression, comparative genomic hybridization, etc.), determines the most efficient placement of each probe on the microarray, creates the digital masks, and writes a command file for the DOC instrument.  The DOC instrument operation software uses as inputs the masks and command file created by the design software.  The operation software controls the steps necessary for the microarray fabrication and light directed DNA synthesis.  Finally, a custom program for data extraction, termed DOC_U_MENTOR, determines the intensity values of each feature.

An example output, in this case a resequencing chip, being analyzed by DOC_U_MENTOR.  Note that the interactive image data with a grid overlay is in the left window, and the right windows contain information about the chip, each probe on it including intensity and background values and the final assembled sequence (with variants, putative SNPs, presented when the experimental sequence is aligned to an input, wild type, sequence.

Back to Top


ARROGANT ­ ARRay ORGANizing Tool, a computer program for design and analysis of microarrays

ARROGANT is a computer tool that has been developed to facilitate the identification, analysis and comparison of collections of genes or clones.35 ARROGANT, in the analysis mode, is a comprehensive tool for providing annotation to large gene collections. ARROGANT takes over at the point where the data is usually presented to the researcher (cluster diagrams, values in spread sheets), and it is the responsibility of the researcher to accumulate facts about the genes to interpret experimental data.  ARROGANT in its analysis mode, takes in a large collection of gene identifiers and associates them with other information collected from many sources, including sequence annotations, pathways, homology, polymorphisms, artifacts etc. In the design mode, ARROGANT assists in compiling a gene collection, using several different databases simultaneously, queried with keywords and their synonyms. ARROGANT in one integrated package facilitates the design of expression / resequencing microarrays by designing primers, looking for commercially available clones and designing probes for resequencing.  The package also has a third mode of operation to eliminate sequence redundancies and duplicates from multiple gene collections. This is useful in identifying redundancies due to sequences or clones having different accession numbers but representing fragments of the same gene in a manner similar to UniGene clusters. This simplifies integrating data from different array designs/formats or from various research groups. 

ARROGANT output selector indicates all the types of information collected for each gene/clone.  

With this information at the users fingertips, which is sortable and searchable, the interpretation and hypothesis/model generation is made easier for the user that otherwise would have to serially scan many databases to get the desired information.  Note that various types of experimental data can be overlayed on this information.  ARROGANT has been successfully applied to many large gene collections for microarrays, complex multigenic trait projects, polymorphism discovery projects etc. ARROGANT is available over the www at http://innovation.swmed.edu. For example, ARROGANT was used to analyze colon cancer data from microarrays resulting in the identification of genes that play a role in the growth of cancer cells.   The annotation provided by ARROGANT revealed that one gene was a chemotactic factor for lymphocytes, leading to the hypothesis that fewer lymphocytes are recruited to tumor sites allowing for growth of the tumor unchecked by the immune system. 

Back to Top


IRIDESCENT ­ a data mining code for the identification of implicit and direct connections among biomedical objects for the analysis of experimental data, hypothesis generation and exploration

An automated text data mining system that constructs networks of relationships among relevant “biomedical objects” to identify novel implicit relationships has been developed to facilitate knowledge discovery and hypothesis generation. The IRIDESCENT system  (Implicit Relationship IDEntification by Software Construction of an Entity-based Network from Text), has been applied to biomedicine to characterize the performance of the system, to discover direct and indirect relationships and to score those relationships for statistical significance.  This implementation functions by defining a domain (a query) of objects that are a subset of a broad compilation of biomedical objects (e.g. genes, diseases, phenotypes, small molecules) to extract relevant data from the MEDLINE database and then by co-occurrence of terms within sentences and abstracts finds new connections (relationships). We have validated the utility of the IRIDESCENT system by applying it to a variety of datasets. IRIDESCENT was used to analyze data from Affymetrix microarrays (69 reproducible differential expression responding genes) from experiments conducted in collaboration with Suzanne Fuqua (Baylor Breast Cancer SPORE in Houston).  That analysis gave insight as to the possible mechanisms responsible for breast cancer by identifying from the literature what concepts most closely bind all the genes mentioned.

IRIDESCENT was used to process the data from 29 lung cancer cell lines analyzed using Affymetrix U95a GeneChips by John Minna and Luc Girard.  That experiment set consisted of a variety of lung cancer cell lines (SCLC, Adenocarcinoma and Squamous cell carcinoma).  The input to IRIDESCENT consisted of ~30 gene names of those that were up regulated or down regulated for each type relative to normal human bronchial epiethlium.  The code identified direct (direct links from one gene to another on the list) and indirect (linked through an intermediate term) links found in all Medline references for each term that bind the lists together.  Inspection of the list is revealing about the commonalities of the data.  Foremost, the term “carcinoma” linked all the members of the various lists of genes indicating that in the literature, cancer is the common item in which these genes are most found, giving faith that the code has indeed found good linkages without giving it information other than gene names (and not it that the experiment involved cancer).  When the 30 most upregulated genes from all SCLC lines are analyzed for set commonalities, the objects "Nuclear Proteins" and "DNA binding proteins" are most significant, suggesting a good portion of the up regulated genes encode products that interact with DNA.  The code identified other genes involved in SCLC that may play a role in the development/pathology of SCLC by virtue of their association with up-regulated genes. From the 34 most down regulated genes, we see significant scores for two growth factors, integrin and Lung Cancer.  Genes in this list such as CAV1, AIM1 and TGFBR2 are all involved in tumor suppression through the inhibition of growth factors.  Their down regulation enables the perpetuation of growth signals to the tumor. Integrins are involved in cell adhesion with laminins, and their loss contributes to metastasis. Seventeen of these genes are already known to be involved in lung cancer.  Almost all the notable genes associated with this list are keratins. Keratins are predominantly associated with epithelial cells and, like integrins, play a role in cell-cell adhesion. Loss of keratins may allow the cell to "break out" of its epithelial matrix and become invasive.   

Back to Top


ELXR ­ a code that identifies exons, CpG islands and other features in the genome and provides primers for their exploitation.

Resequencing the protein coding regions of genes is a common technique used for detecting single nucleotide polymorphisms (SNPs) or other sequence based mutations.  Typically, this detection process requires the elucidation of gene structure, design of polymerase chain reaction (PCR) primers, and subsequent sequencing of the PCR product.  We have developed a www-based computer program called the Exon Locator and Extractor for Resequencing (ELXR) to facilitate this process.  This program combines multiple tools for making the analysis of discrete exons a more robust and streamlined process by 1) retrieving query mRNA sequences from public databases, 2) retrieving corresponding genomic sequence encapsulating the query mRNA, 3) determining intron/exon boundaries, and 4) designing PCR/sequencing primer pairs flanking each exon.  This process takes a fraction of the time of manual methods and follows a consistent method.   Using ELXR, we have pre-computed primer sets for all exons identified from the entire NCBI human mRNA reference sequence (RefSeq) public database.  These results have been compiled into a queryable system called ELXRdb, which may be searched by keyword, gene name or RefSeq accession number.  ELXR and ELXRdb, along with documentation, are WWW services located at http://exon.swmed.edu/index.html and http://morpheus.swmed.edu/elxrdb_query.html.  This key code provides the data necessary to identify CpG islands in putative promoter regions for methylation analysis, primers sets that span exon boundaries for alternative splice determination and for resequencing.

Back to Top


SNP Prediction and Verification: Development of the informatics tool SNIDE:

The focus of recent public and private polymorphism research has been on identifying single nucleotide polymorphisms (SNPs).  There is tremendous potential for the application of these SNPs for analysis of genetic alterations taking place in the pathogenesis of human cancer.  We developed an algorithm, called SNIDE (SNp IDEntification), based on a statistical analysis of a large collection of clinical mutation data (HGMD database) to compute relative mutation rate expectation values for each codon triplet in any given gene and are likely to have a significant impact on gene activity when mutated.  The most predictive class of high-impact mutation (Arg to Stop) is occurs 2,500 times more frequently than the least predictive and 100 times more than the median.  As part of other funded research we are applying this algorithm to a large number of genes and resequencing several promising regions for verification using gel-based sequencing.  Initial results on 132 dilated cardiomyopathy and 60 cancer patients found 21 new SNP sites (3 previously known), for a total of 169 variants.  A high impact causative SNP was discovered in the bradykinin b2 receptor (BDKRB2, a G-protein coupled receptor) gene. We intend to compile a list of ranked SNPs from the SNIDE computation for candidate cancer gene exons and produce a DOC microarray to simultaneously interrogate them on cancer samples for high-impact causative variations.

Back to Top


DNA Sequence Analysis and Visualization: Development of the informatics tool PANORAMA

Researchers are working to acquire a complete picture of the genetic abnormalities that occur in the pathogenesis of human cancers such as lung cancer.  As the exponential growth of DNA sequence information in databases continues, the task of converting this deposited information into knowledge becomes more dependent on integrative sequence analysis and visualization tools.  Panorama was developed via a close collaboration with the bioinformatics group in the Garner Lab and the cancer group in the Minna Lab.

PANORAMA is an Internet-accessible software package that performs a variety of informatics analyses on a given DNA sequence and returns a visual and Java interactive representation of the results. It compares an input DNA sequence with EST and non-EST GenBank (using BLAST, ESTs are in blue, non-ESTs are in red), predicts exons, cDNA and peptide sequences (using GenScan, in orange), scans for CpG islands (in peach), predicts polymorphic markers, identifies human repetitive regions and simple sequences (using BLAST, in green), and predicts potential DNA structures including triplexes, tetraplexes, and Z-DNA. Its design is modular, so that further sequence analysis tools can be integrated.

Shown in the PANORAMA output for a 30kb portion of the 3p21 region where candidate tumor suppressor genes such as 101F6, NPRL2, RASSF1A and FUS1 were identified. 

The utility of PANORAMA is demonstrated in the analysis of an initial 750 kb of human genomic DNA from chromosome region 3p21.3, a region of potential tumor suppressor genes (TSGs) involved in lung, breast and other forms of cancer. PANORAMA aided in the discovery of genes, alternate splice forms of known exons, demarcation of intron-exon boundaries, promoter region identification and polymorphisms, all of which contributed to a better understanding of the region and to the design of DOC chips for its further analysis. PANORAMA is available on the World Wide Web at http://innovation.swmed.edu.  We will modify PANORAMA slightly to provide a visualization engine for the combined expression, methylation, alternative splicing, etc. and other available annotation data.  This should assist in our interpretation of the various data emerging from the DOC chip experiments

Back to Top