We propose to work in the development and application of mathematical, statistical, and computational methods for the analysis of nucleic acid and amino acid sequence data. The long range goals can be placed into three categories. (1) Computational analysis is essential to our approaches to sequence data. Algorithms are being developed for shotgun sequence assembly, to search for tandem repeats of length up to 32 basepairs, to find the consensus local alignment of an unknown region common to an unknown subset of sequences, to study the thermodynamic/statistical behavior of experiments that repeatedly select and amplify DNA molecules, and to weight multiple and suboptimal sequence alignment paths. (2) Physical mapping of DNA is important in genome analysis. Studies include the PEP procedure to amplify single chromosomes, PCR is a branching process including both amplification errors and efficiency less than 1, the mathematical analysis of physical mapping using end characterized clones, and classification of multiple solutions of the double digest problem. (3) As sequence data increase, estimating statistical significance becomes more central. We will develop methods for estimating statistical significance of scores of tandem repeats, Poisson distributional results for sequence alignment in certain cases where the Chen-Stein method fails, the statistical distribution of correctly inferred sequence in shotgun sequencing projects as a function of depth and accuracy, and the growth of minimum free energy of secondary structures of a random RNA.
Showing the most recent 10 out of 23 publications