We propose to continue (A) the development of mathematical, statistical and computer methods for the analysis of DNA, RNA and protein sequences and (B) the application of these methods. The comparison of two and more informational sequences is central to many problems in molecular biology. (1) Finding consenting patterns that define genetic control regions or that determine structure or function are important examples of sequence comparisons. An algorithm already developed by my group will be developed further and applied to several new data sets, such as Pol II promoters and RNA splice signals. Careful data analyses should suggest new modifications to the method. New and nontrivial insights into promoter patterns, for example, could result from an unbiased, rigorous analysis with calculated significance levels. (2) Secondary structure of 5S, 16S, and 23S rRNA has been inferred by the phylogenetic method. Consensus and probability results will be developed to solve this problem in a rigorous way. Again, new information about secondary structure could result. (3) T1 catalogs are available for 16S rRNA from many organisms. A careful analysis, based on pattern and significance of found patterns, will be made. This will constitute a new and entirely unbiased study of divisions such as archaebacteria, eukaryotes, and eubacteria. (4) Recent important results have been established for the exact (extreme value) distribution of long exact matches between random sequences. These distributions are fundamental to pattern recognition in general and allow statistical assessment of found patterns. The distributions will be extended to include results of long matching where mismatches and insertion/deletions are allowed.

Agency
National Institute of Health (NIH)
Institute
National Institute of General Medical Sciences (NIGMS)
Type
Research Project (R01)
Project #
5R01GM036230-09
Application #
2178222
Study Section
Genome Study Section (GNM)
Project Start
1986-01-01
Project End
1994-12-31
Budget Start
1994-01-01
Budget End
1994-12-31
Support Year
9
Fiscal Year
1994
Total Cost
Indirect Cost
Name
University of Southern California
Department
Biostatistics & Other Math Sci
Type
Schools of Arts and Sciences
DUNS #
041544081
City
Los Angeles
State
CA
Country
United States
Zip Code
90089
Kruglyak, S; Tang, H (2001) A new estimator of significance of correlation in time series data. J Comput Biol 8:463-70
Kruglyak, S; Tang, H (2000) Regulation of adjacent yeast genes. Trends Genet 16:109-11
Heyer, L J; Kruglyak, S; Yooseph, S (1999) Exploring expression data: identification and analysis of coexpressed genes. Genome Res 9:1106-15
Lee, J K; Dancik, V; Waterman, M S (1998) Estimation for restriction sites observed by optical mapping using reversible-jump Markov Chain Monte Carlo. J Comput Biol 5:505-15
Dancik, V; Hannenhalli, S; Muthurkrishnan, S (1997) Hardness of flip-cut problems from optical mapping. J Comput Biol 4:119-25
Komatsoulis, G A; Waterman, M S (1997) A new computational method for detection of chimeric 16S rRNA artifacts generated by PCR amplification from mixed bacterial populations. Appl Environ Microbiol 63:2338-46
Agarwala, R; Batzoglou, S; Dancik, V et al. (1997) Local rules for protein folding on a triangular lattice and generalized hydrophobicity in the HP model. J Comput Biol 4:275-96
Arratia, R; Martin, D; Reinert, G et al. (1996) Poisson process approximation for sequence repeats, and sequencing by hybridization. J Comput Biol 3:425-63
Port, E; Sun, F; Martin, D et al. (1995) Genomic mapping by end-characterized random clones: a mathematical analysis. Genomics 26:84-100
Sun, F; Arnheim, N; Waterman, M S (1995) Whole genome amplification of single cells: mathematical analysis of PEP and tagged PCR. Nucleic Acids Res 23:3034-40

Showing the most recent 10 out of 23 publications