The human Genome Project is generating a huge data stream of DNA sequences for which computational tools are needed for analysis are needed for analysis. Human DNA or mRNA sequences coding for expressed genes will be of biomedical interest for genetic diseases or therapeutic polypeptides. An examination of mRNA sequences in GenBank has revealed that calculated mRNA folding is more stable than expected by chance. Free energy minimization calculations of native mRNA sequences are more negative than randomized mRNA sequences of the same composition.. This suggests a bias in codon choice that favors mRNA structures that have greater folding stability. If codon choice facilitates mRNA folding by base pairing, then there should exist a correlation between codon and reverse complement codon frequencies. When codons are graphically paired to their reverse complement codons, the twenty amino acids group into three independent families. These three amino acid (aa) families each posses charged, polar, and non-polar members. Statistical runs tests of aa from one family supports the hypothesis that the graph theory representation has biological significance. These results will be applied to analyze yeast transcriptome SAGE data, and eventually human SAGE data when available. This proposed work seeks to 1) develop a classification of proteins based on these three aa families, 2) determine if mRNA folding stability is greater than expected due to the decomposition of the twenty amino acids into the three families, 3) correlate yeast transcriptome expression levels with folding stability bias, and 4) determine if a computational neural network can backtranslate an aa sequence into a DNA sequence based on correlations between codon and reverse-complement codon frequencies. This proposed work will assist the understanding of human gene expression and codon bias, and would be of biomedical interest for 1) design of primers for reverse transcriptase PCR from mRNA, 2) antisense gene therapy concerning mRNA folding stability, 3) degenerate primer PCR cloning from backtranslated amino acid sequences, and 4) provide computational tools to analyze and characterize proteins and mRNA sequences in GenBank and transcriptome SAGE data.
Showing the most recent 10 out of 17 publications