It is expected that within 2-3 years several complete sequences of bacterial genomes will be available. An adequate strategy for computer analysis of these genomes is essential for the successful development of comparative bacterial genomics. In anticipation of this progress, we undertook a detailed analysis of the available protein sequences encoded in the chromosome of Escherichia coli, arguably the best studied bacterium. This work is simultaneously a pilot project for future genome studies and a systematic effort on prediction of new functions of bacterial proteins and their eukaryotic homologs. A detailed computer analysis of 2,328 protein sequences comprising about 60% of the E. coli gene products was performed using methods for database screening with individual sequences and alignment blocks. A surprisingly high fraction of E. coli proteins - about 86% - showed significant sequence similarity to other proteins in current databases; about 70% show conservation at least at the level of distantly related bacteria, and about 40% contain regions conserved in eukaryotic or Archaeal proteins. For over 90% of the E. coli proteins, either functional information, or significant sequence similarity, or both, are available. About 46% of the E. coli proteins belong to 286 clusters of intraspecies homologs (paralogs) defined as having significant pairwise similarity. Another 10% could be included in clusters using sensitive methods to detect conserved motifs. The majority of the clusters include only two to four members. In contrast, nearly 25% of all E. coli proteins belong to the four largest classes of paralogs. These classes include permeases; ATPases and GTPases with the conserved """"""""Walker-type"""""""" motif; helix- turn-helix regulatory proteins; and dinucleotide-binding proteins. Genes encoding paralogous proteins are non-randomly distributed along the chromosome. Sequence similarity with E. coli proteins allowed the prediction of possible functions of a number of important eukaryotic genes, including several whose products are implicated in human diseases. We conclude that bacterial protein sequences generally are highly conserved in evolution, and with the currently available databases and methods of their screening, detailed computer analysis yields information on the functions of the vast majority of genes in a bacterial genome. The significance of the project is in the development of a strategy for comparative analysis of protein sequences at a genome scale.

Agency
National Institute of Health (NIH)
Institute
National Library of Medicine (NLM)
Type
Intramural Research (Z01)
Project #
1Z01LM000054-03
Application #
5203628
Study Section
Project Start
Project End
Budget Start
Budget End
Support Year
3
Fiscal Year
1995
Total Cost
Indirect Cost
Name
National Library of Medicine
Department
Type
DUNS #
City
State
Country
United States
Zip Code