We propose to continue (A) the development of mathematical, statistical and computer methods for the analysis of DNA, RNA and protein sequences and (B) the application of these methods. The comparison of two and more informational sequences is central to many problems in molecular biology. (1) Finding consenting patterns that define genetic control regions or that determine structure or function are important examples of sequence comparisons. An algorithm already developed by my group will be developed further and applied to several new data sets, such as Pol II promoters and RNA splice signals. Careful data analyses should suggest new modifications to the method. New and nontrivial insights into promoter patterns, for example, could result from an unbiased, rigorous analysis with calculated significance levels. (2) Secondary structure of 5S, 16S, and 23S rRNA has been inferred by the phylogenetic method. Consensus and probability results will be developed to solve this problem in a rigorous way. Again, new information about secondary structure could result. (3) T1 catalogs are available for 16S rRNA from many organisms. A careful analysis, based on pattern and significance of found patterns, will be made. This will constitute a new and entirely unbiased study of divisions such as archaebacteria, eukaryotes, and eubacteria. (4) Recent important results have been established for the exact (extreme value) distribution of long exact matches between random sequences. These distributions are fundamental to pattern recognition in general and allow statistical assessment of found patterns. The distributions will be extended to include results of long matching where mismatches and insertion/deletions are allowed.
Showing the most recent 10 out of 23 publications