We propose to work in the development and application of mathematical, statistical, and computational methods for the analysis of nucleic acid and amino acid sequence data. The long range goals can be placed into three categories. (1) Computational analysis is essential to our approaches to sequence data. Algorithms are being developed for shotgun sequence assembly, to search for tandem repeats of length up to 32 basepairs, to find the consensus local alignment of an unknown region common to an unknown subset of sequences, to study the thermodynamic/statistical behavior of experiments that repeatedly select and amplify DNA molecules, and to weight multiple and suboptimal sequence alignment paths. (2) Physical mapping of DNA is important in genome analysis. Studies include the PEP procedure to amplify single chromosomes, PCR is a branching process including both amplification errors and efficiency less than 1, the mathematical analysis of physical mapping using end characterized clones, and classification of multiple solutions of the double digest problem. (3) As sequence data increase, estimating statistical significance becomes more central. We will develop methods for estimating statistical significance of scores of tandem repeats, Poisson distributional results for sequence alignment in certain cases where the Chen-Stein method fails, the statistical distribution of correctly inferred sequence in shotgun sequencing projects as a function of depth and accuracy, and the growth of minimum free energy of secondary structures of a random RNA.

Agency
National Institute of Health (NIH)
Institute
National Institute of General Medical Sciences (NIGMS)
Type
Research Project (R01)
Project #
5R01GM036230-13
Application #
2634657
Study Section
Genome Study Section (GNM)
Project Start
1986-01-01
Project End
1999-12-31
Budget Start
1998-01-01
Budget End
1999-12-31
Support Year
13
Fiscal Year
1998
Total Cost
Indirect Cost
Name
University of Southern California
Department
Biostatistics & Other Math Sci
Type
Schools of Arts and Sciences
DUNS #
041544081
City
Los Angeles
State
CA
Country
United States
Zip Code
90089
Kruglyak, S; Tang, H (2001) A new estimator of significance of correlation in time series data. J Comput Biol 8:463-70
Kruglyak, S; Tang, H (2000) Regulation of adjacent yeast genes. Trends Genet 16:109-11
Heyer, L J; Kruglyak, S; Yooseph, S (1999) Exploring expression data: identification and analysis of coexpressed genes. Genome Res 9:1106-15
Lee, J K; Dancik, V; Waterman, M S (1998) Estimation for restriction sites observed by optical mapping using reversible-jump Markov Chain Monte Carlo. J Comput Biol 5:505-15
Dancik, V; Hannenhalli, S; Muthurkrishnan, S (1997) Hardness of flip-cut problems from optical mapping. J Comput Biol 4:119-25
Komatsoulis, G A; Waterman, M S (1997) A new computational method for detection of chimeric 16S rRNA artifacts generated by PCR amplification from mixed bacterial populations. Appl Environ Microbiol 63:2338-46
Agarwala, R; Batzoglou, S; Dancik, V et al. (1997) Local rules for protein folding on a triangular lattice and generalized hydrophobicity in the HP model. J Comput Biol 4:275-96
Arratia, R; Martin, D; Reinert, G et al. (1996) Poisson process approximation for sequence repeats, and sequencing by hybridization. J Comput Biol 3:425-63
Port, E; Sun, F; Martin, D et al. (1995) Genomic mapping by end-characterized random clones: a mathematical analysis. Genomics 26:84-100
Sun, F; Arnheim, N; Waterman, M S (1995) Whole genome amplification of single cells: mathematical analysis of PEP and tagged PCR. Nucleic Acids Res 23:3034-40

Showing the most recent 10 out of 23 publications