Genome sequences of a bacterial species are relatively similar in terms of their tetranucleotide frequencies owing to biases associated with many processes, such as DNA replication and repair systems, DNA restriction enzyme systems, and coding preferences, as well as physical constraints such as dinucleotide stacking energies, curvature and superhelicity of DNA. However, some regions of genomes, such as those encoding ribosomal RNA and ribosomal proteins, and those bearing foreign DNA, are notably different in their tetranucleotide frequencies, variances of the tetranucleotide frequencies, and G+C content. In comparing the heterogeneity of genome sequences, it is often difficult to determine if subtle fluctuations of tetranucleotide frequencies in different regions of a genome are due to natural variation within the genome, or if they are due to the mechanisms of genetic change such as deletions, insertions, transpositions, inversions, duplications and recombinations of genetic material. Heterogeneity among and within genomes will be examined by comparing the tetranucleotide frequencies of sections (3000 bp) of microbial genomes. Methods have been developed for identifying variable and conserved regions of microbial genomes by constructing composite portraits using tetranucleotide frequencies of sections of genomic DNA and by computing the variance of tetranucleotide frequencies. A novel approach for analyzing complex, nonlinear data by using neural computing in concert with cluster analyses has been developed by the principal investigator. The composition of microbial genomes is being explored by training a back-propagating neural network (NN) to recognize the tetranucleotide frequencies of sections of genome DNA. This is being accomplished by converting tetranucleotide frequency data to binary format for transmission to a NN, and using the NN to relate these data to the G+C content and/or the variance of tetranucleotide frequencies for each genome section. Once trained, data from the NN will be used to compare sections of microbial genomes. These analyses will make it possible to determine the true relatedness of regions of genomes which would otherwise appear to be similar due to biases associated with DNA replication and repair systems, DNA restriction/modification enzyme systems, and coding preferences. Dendrograms will be produced by using Euclidean distance and average and Ward's linkage methods. The validity of the cluster analysis will be assessed by determining the robustness of cluster memberships using different linkage methods and independently trained NN data. This information is necessary for understanding how microbial genomes evolve, since bacteria maintain relative constancy in the tetranucleotides of their genomes while at the same time changing their genetic composition.

Agency
National Science Foundation (NSF)
Institute
Division of Molecular and Cellular Biosciences (MCB)
Type
Standard Grant (Standard)
Application #
9802342
Program Officer
Philip Harriman
Project Start
Project End
Budget Start
1999-01-01
Budget End
2000-06-30
Support Year
Fiscal Year
1998
Total Cost
$28,680
Indirect Cost
Name
University of South Carolina at Columbia
Department
Type
DUNS #
City
Columbia
State
SC
Country
United States
Zip Code
29208