The unprecedented accumulation of DNA and protein sequence (including the complete physical map of E. coli and the first complete nucleotide sequence of a eukaryotic chromosome (yeast chromosome III) poses challenges and opportunities in terms of organization and analysis. Our proposed research will focus on mathematical, statistical, computational, and informatics problems in this context. Main topics are (i) statistical theory and applications of score-based sequence analysis; (ii) theory and applications of rho- scan statistics for heterogeneity assessments within and among sequences; (iii) development of computer programs for the statistical analysis of protein and nucleotide sequences (SAPS and SANS); (iv) studies on oligonucleotides compositional biases including characterizations of rare and frequent oligonucleotides; (v) definitions and applications of distance measures and orderings among DNA sequences. Score-based sequence analysis methods are in wide use, both with respect to single sequences (e.g., hydropathy plots) and with respect to sequence comparisons (e.g. BLAST). Relevant probability distributions approximations will be derived for sums of high scoring segments, the maximal matching alignment score in the case that scores are random vectors (representing, for example, simultaneously charge, hydrophobicity, and steric attributes of an amino acid). Computer algorithms will be devised to calculate approximate probabilities for given sets of parameters. Rho-scans assess anomalies in the distribution of markers along a line (e.g. restriction sites, special oligonucleotides, nucleosome placements). The theory will be developed to accommodate deviations from a specified theoretical distribution and for data comparisons among several sequences. These programs will implement in addition to above methods a large number of other statistics (e.g., compositional evaluations with multivariate quantile distributions; counts and spacings of close repeats and close dyads) and should help with the design of experiments. Methods for evaluating oligonucleotide compositional biases and distance measures based on oligonucleotide composition are proposed to assess differences and similarities among and within sequences, with particular relevance to functional/structural roles and phylogenetic reconstructions. Intensive detailed studies on large genomic sequences will be conducted for comparative purposes and to identify special regions (origin of replications, regulatory sequences, and structural elements).

Agency
National Institute of Health (NIH)
Institute
National Human Genome Research Institute (NHGRI)
Type
Research Project (R01)
Project #
5R01HG000335-08
Application #
2208744
Study Section
Genome Study Section (GNM)
Project Start
1988-08-01
Project End
1996-07-31
Budget Start
1995-08-01
Budget End
1996-07-31
Support Year
8
Fiscal Year
1995
Total Cost
Indirect Cost
Name
Stanford University
Department
Biostatistics & Other Math Sci
Type
Schools of Arts and Sciences
DUNS #
800771545
City
Stanford
State
CA
Country
United States
Zip Code
94305
Karlin, Samuel; Theriot, Julie; Mrazek, Jan (2004) Comparative analysis of gene expression among low G+C gram-positive genomes. Proc Natl Acad Sci U S A 101:6182-7
Karlin, Samuel; Barnett, Melanie J; Campbell, Allan M et al. (2003) Predicting gene expression levels from codon biases in alpha-proteobacterial genomes. Proc Natl Acad Sci U S A 100:7313-8
Mrazek, Jan; Gaynon, Lisa H; Karlin, Samuel (2002) Frequent oligonucleotide motifs in genomes of three streptococci. Nucleic Acids Res 30:4216-21
Karlin, Samuel; Chen, Chingfer; Gentles, Andrew J et al. (2002) Associations between human disease genes and overlapping gene groups and multiple amino acid runs. Proc Natl Acad Sci U S A 99:17008-13
Ma, Jiong; Campbell, Allan; Karlin, Samuel (2002) Correlations between Shine-Dalgarno sequences and gene features such as predicted expression levels and operon structures. J Bacteriol 184:5733-45
Karlin, Samuel; Brocchieri, Luciano; Bergman, Aviv et al. (2002) Amino acid runs in eukaryotic proteomes and disease associations. Proc Natl Acad Sci U S A 99:333-8
Chen, Chingfer; Gentles, Andrew J; Jurka, Jerzy et al. (2002) Genes, pseudogenes, and Alu sequence organization across human chromosomes 21 and 22. Proc Natl Acad Sci U S A 99:2930-5
Karlin, Samuel; Brocchieri, Luciano; Trent, Jonathan et al. (2002) Heterogeneity of genome and proteome content in bacteria, archaea, and eukaryotes. Theor Popul Biol 61:367-90
Karlin, S (2001) Detecting anomalous gene clusters and pathogenicity islands in diverse bacterial genomes. Trends Microbiol 9:335-43
Brocchieri, L (2001) Phylogenetic inferences from molecular sequences: review and critique. Theor Popul Biol 59:27-40

Showing the most recent 10 out of 74 publications