With the fraction of known sequences in the Escherichia coli chromosome now-exceeding 50 per cent, the goal of comprehensive computer analysis of the bacterial genome is becoming realistic. The scope of this project includes development of an optimal strategy for analysis of the genetic contents of the genome; assessment of the utility of different computer-assisted methods in large-scale genome projects; identification of all genes in the bacterial chromosome; and extraction of maximal amount of information on possible functions and evolutionary relationships of gene products; delineation off possible regularities in the distribution of related genes in the bacterial chromosome. Comparison of the 1400 protein sequences contained in the EcoSeq6 database with the complete amino acid sequence databases was performed, with particular emphasis on the relationship between various E.coli proteins. A variety of computer methods for database search, motif identification, and multiple sequence alignment were employed, including newly developed algorithms. As the result, probable functions were predicted for a number of previously uncharacterized putative open reading frame products, and several new proteins families and highly conserved, probably functionally important sequence motifs were described. The most interesting findings included: a putative new system of regulated, GTP-dependent proteolysis; a family of putative GTP phosphohydrolases related to the antimutator protein MutT, with an apparent GTP-binding motif of a novel type; two previously uncharacterized DNA or RNA helicases belonging to distinct groups within the """"""""DEAD/H"""""""" superfamily; several unknown putative methyltransferases. New, unexpected relationships were found for proteins that have been previously characterized functionally, but not structurally, e.g. it was shown that diadenosine tetraphosphate phosphohydrolase (ApaH) is related to protein phosphatases; and RNase T is related to DNA proofreading exonucleases. Regions of the E.coli chromosome that have been annotated as untranslated in the EcoSeq6 database were explored using the GENMARK method for coding region prediction and BLASTX program for database search. As the result, about 100 new genes were predicted to exist in the E.coli chromosome encoding putative enzymes, membrane proteins, and regulatory proteins. Strong correlation was established between the results of GENMARK prediction and similarity search, suggesting that the coding regions predicted by GENMARK, but not showing similarity to sequences available in current databases are still likely to correspond to new genes. The significance of the project lies in the potential for development of optimal strategy for computer analysis of gene functions and arrangement at the whole genome scale; and in the prediction of likely functions for many gene products leading to stimulation of further experimental dissection.

Agency
National Institute of Health (NIH)
Institute
National Library of Medicine (NLM)
Type
Intramural Research (Z01)
Project #
1Z01LM000054-01
Application #
3781286
Study Section
Project Start
Project End
Budget Start
Budget End
Support Year
1
Fiscal Year
1993
Total Cost
Indirect Cost
Name
National Library of Medicine
Department
Type
DUNS #
City
State
Country
United States
Zip Code