The accumulation of molecular sequence data is proceeding at an unprecedented pace. The next phase of molecular biology will be increasingly dominated by efforts to characterize, categorize, and analyze these data with the goal of understanding molecular sequence information and its significance in biological systems. The investigators' proposal is aimed at achieving a deeper understanding of genome structure, function, and evolution using empirical, descriptive and interactive statistical and computational methods. They focus primarily on three interrelated areas: I. Analysis of codon usage patterns. Detailed knowledge of codon and residue choices can help in gene prediction, in characterizing properties of a given gene, and in defining gene classes. They propose a broad analysis of codon usage biases for individual genes and gene classes in complete prokaryotic and eukaryotic genomes. In particular, the investigators' studies will concern codon preferences in different gene classes, including (i) gene classes characterized by function and/or cellular localization; (ii) classes determined by gene size; (iii) codons of a gene divided into three parts: the amino 1/3 part, the middle 1/3 part, and the carboxyl 1/3 part; (iv) genes encoded from the leading vs. lagging strand; and (v) classes of horizontally transferred genes characterized with the aid of codon bias extremes. II. Studies of anomalous genes, including alien genes, highly expressed genes, and those in pathogenicity islands. In complete genomes or in extended contigs of great biological and medical interest are characterizations of alien genes (e.g., laterally transferred), or of alien gene clusters (e.g., pathogenicity or specialization islands), or of highly expressed genes. III. Statistical methods for genome sequence analysis. These will include: (a) characterizations of genomic heterogeneity within and between organisms (e.g., in terms of rare and frequent nucleotides, of motifs, or of compositional biases); (b) extensions of r-scan statistics, which assess anomalies in the distribution of markers along sequences; and (c) statistics of recurrent sequences among genomes characterized by numbers of repeat families, by their sizes (bp or aa.), by spacings between repeats, and by properties of repeat families (intergenic, coding, direct, inverted, mixed).

Agency
National Institute of Health (NIH)
Institute
National Human Genome Research Institute (NHGRI)
Type
Research Project (R01)
Project #
2R01HG000335-12
Application #
2901693
Study Section
Genome Study Section (GNM)
Program Officer
Brooks, Lisa
Project Start
1988-08-01
Project End
2002-07-31
Budget Start
1999-08-01
Budget End
2000-07-31
Support Year
12
Fiscal Year
1999
Total Cost
Indirect Cost
Name
Stanford University
Department
Biostatistics & Other Math Sci
Type
Schools of Arts and Sciences
DUNS #
800771545
City
Stanford
State
CA
Country
United States
Zip Code
94305
Karlin, Samuel; Theriot, Julie; Mrazek, Jan (2004) Comparative analysis of gene expression among low G+C gram-positive genomes. Proc Natl Acad Sci U S A 101:6182-7
Karlin, Samuel; Barnett, Melanie J; Campbell, Allan M et al. (2003) Predicting gene expression levels from codon biases in alpha-proteobacterial genomes. Proc Natl Acad Sci U S A 100:7313-8
Mrazek, Jan; Gaynon, Lisa H; Karlin, Samuel (2002) Frequent oligonucleotide motifs in genomes of three streptococci. Nucleic Acids Res 30:4216-21
Karlin, Samuel; Chen, Chingfer; Gentles, Andrew J et al. (2002) Associations between human disease genes and overlapping gene groups and multiple amino acid runs. Proc Natl Acad Sci U S A 99:17008-13
Ma, Jiong; Campbell, Allan; Karlin, Samuel (2002) Correlations between Shine-Dalgarno sequences and gene features such as predicted expression levels and operon structures. J Bacteriol 184:5733-45
Karlin, Samuel; Brocchieri, Luciano; Bergman, Aviv et al. (2002) Amino acid runs in eukaryotic proteomes and disease associations. Proc Natl Acad Sci U S A 99:333-8
Chen, Chingfer; Gentles, Andrew J; Jurka, Jerzy et al. (2002) Genes, pseudogenes, and Alu sequence organization across human chromosomes 21 and 22. Proc Natl Acad Sci U S A 99:2930-5
Karlin, Samuel; Brocchieri, Luciano; Trent, Jonathan et al. (2002) Heterogeneity of genome and proteome content in bacteria, archaea, and eukaryotes. Theor Popul Biol 61:367-90
Karlin, S (2001) Detecting anomalous gene clusters and pathogenicity islands in diverse bacterial genomes. Trends Microbiol 9:335-43
Brocchieri, L (2001) Phylogenetic inferences from molecular sequences: review and critique. Theor Popul Biol 59:27-40

Showing the most recent 10 out of 74 publications