Molecular Sequence Data

Karlin, Samuel

Abstract

The unprecedented accumulation of DNA and protein sequence (including the complete physical map of E. coli and the first complete nucleotide sequence of a eukaryotic chromosome (yeast chromosome III) poses challenges and opportunities in terms of organization and analysis. Our proposed research will focus on mathematical, statistical, computational, and informatics problems in this context. Main topics are (i) statistical theory and applications of score-based sequence analysis; (ii) theory and applications of rho- scan statistics for heterogeneity assessments within and among sequences; (iii) development of computer programs for the statistical analysis of protein and nucleotide sequences (SAPS and SANS); (iv) studies on oligonucleotides compositional biases including characterizations of rare and frequent oligonucleotides; (v) definitions and applications of distance measures and orderings among DNA sequences. Score-based sequence analysis methods are in wide use, both with respect to single sequences (e.g., hydropathy plots) and with respect to sequence comparisons (e.g. BLAST). Relevant probability distributions approximations will be derived for sums of high scoring segments, the maximal matching alignment score in the case that scores are random vectors (representing, for example, simultaneously charge, hydrophobicity, and steric attributes of an amino acid). Computer algorithms will be devised to calculate approximate probabilities for given sets of parameters. Rho-scans assess anomalies in the distribution of markers along a line (e.g. restriction sites, special oligonucleotides, nucleosome placements). The theory will be developed to accommodate deviations from a specified theoretical distribution and for data comparisons among several sequences. These programs will implement in addition to above methods a large number of other statistics (e.g., compositional evaluations with multivariate quantile distributions; counts and spacings of close repeats and close dyads) and should help with the design of experiments. Methods for evaluating oligonucleotide compositional biases and distance measures based on oligonucleotide composition are proposed to assess differences and similarities among and within sequences, with particular relevance to functional/structural roles and phylogenetic reconstructions. Intensive detailed studies on large genomic sequences will be conducted for comparative purposes and to identify special regions (origin of replications, regulatory sequences, and structural elements).

Funding Agency

Agency: National Institute of Health (NIH)
Institute: National Human Genome Research Institute (NHGRI)
Type: Research Project (R01)
Project #: 5R01HG000335-08
Application #: 2208744
Study Section: Genome Study Section (GNM)

Project Start: 1988-08-01
Project End: 1996-07-31
Budget Start: 1995-08-01
Budget End: 1996-07-31
Support Year: 8
Fiscal Year: 1995
Total Cost
Indirect Cost

Institution

Name: Stanford University
Department: Biostatistics & Other Math Sci
Type: Schools of Arts and Sciences
DUNS #: 800771545

City: Stanford
State: CA
Country: United States
Zip Code: 94305

Related projects

Publications

Karlin, Samuel; Theriot, Julie; Mrazek, Jan (2004) Comparative analysis of gene expression among low G+C gram-positive genomes. Proc Natl Acad Sci U S A 101:6182-7

Karlin, Samuel; Barnett, Melanie J; Campbell, Allan M et al. (2003) Predicting gene expression levels from codon biases in alpha-proteobacterial genomes. Proc Natl Acad Sci U S A 100:7313-8

Mrazek, Jan; Gaynon, Lisa H; Karlin, Samuel (2002) Frequent oligonucleotide motifs in genomes of three streptococci. Nucleic Acids Res 30:4216-21

Karlin, Samuel; Chen, Chingfer; Gentles, Andrew J et al. (2002) Associations between human disease genes and overlapping gene groups and multiple amino acid runs. Proc Natl Acad Sci U S A 99:17008-13

Ma, Jiong; Campbell, Allan; Karlin, Samuel (2002) Correlations between Shine-Dalgarno sequences and gene features such as predicted expression levels and operon structures. J Bacteriol 184:5733-45

Karlin, Samuel; Brocchieri, Luciano; Bergman, Aviv et al. (2002) Amino acid runs in eukaryotic proteomes and disease associations. Proc Natl Acad Sci U S A 99:333-8

Chen, Chingfer; Gentles, Andrew J; Jurka, Jerzy et al. (2002) Genes, pseudogenes, and Alu sequence organization across human chromosomes 21 and 22. Proc Natl Acad Sci U S A 99:2930-5

Karlin, Samuel; Brocchieri, Luciano; Trent, Jonathan et al. (2002) Heterogeneity of genome and proteome content in bacteria, archaea, and eukaryotes. Theor Popul Biol 61:367-90

Karlin, S (2001) Detecting anomalous gene clusters and pathogenicity islands in diverse bacterial genomes. Trends Microbiol 9:335-43

Brocchieri, L (2001) Phylogenetic inferences from molecular sequences: review and critique. Theor Popul Biol 59:27-40

Showing the most recent 10 out of 74 publications

Comments

Be the first to comment on Samuel Karlin's grant

Recent in Grantomics:

Recently viewed grants:

Recently added grants: