The unprecedented accumulation of DNA and protein sequence (including the complete physical map of E. coli and the first complete nucleotide sequence of a eukaryotic chromosome (yeast chromosome III) poses challenges and opportunities in terms of organization and analysis. Our proposed research will focus on mathematical, statistical, computational, and informatics problems in this context. Main topics are (i) statistical theory and applications of score-based sequence analysis; (ii) theory and applications of rho- scan statistics for heterogeneity assessments within and among sequences; (iii) development of computer programs for the statistical analysis of protein and nucleotide sequences (SAPS and SANS); (iv) studies on oligonucleotides compositional biases including characterizations of rare and frequent oligonucleotides; (v) definitions and applications of distance measures and orderings among DNA sequences. Score-based sequence analysis methods are in wide use, both with respect to single sequences (e.g., hydropathy plots) and with respect to sequence comparisons (e.g. BLAST). Relevant probability distributions approximations will be derived for sums of high scoring segments, the maximal matching alignment score in the case that scores are random vectors (representing, for example, simultaneously charge, hydrophobicity, and steric attributes of an amino acid). Computer algorithms will be devised to calculate approximate probabilities for given sets of parameters. Rho-scans assess anomalies in the distribution of markers along a line (e.g. restriction sites, special oligonucleotides, nucleosome placements). The theory will be developed to accommodate deviations from a specified theoretical distribution and for data comparisons among several sequences. These programs will implement in addition to above methods a large number of other statistics (e.g., compositional evaluations with multivariate quantile distributions; counts and spacings of close repeats and close dyads) and should help with the design of experiments. Methods for evaluating oligonucleotide compositional biases and distance measures based on oligonucleotide composition are proposed to assess differences and similarities among and within sequences, with particular relevance to functional/structural roles and phylogenetic reconstructions. Intensive detailed studies on large genomic sequences will be conducted for comparative purposes and to identify special regions (origin of replications, regulatory sequences, and structural elements).
Showing the most recent 10 out of 74 publications