The fraction of known sequences in the Escherichia coli chromosome has recently crossed the 60 mark and the complete sequence is expected within a few years. The unannotated regions of the Escherichia coli genome DNA sequence from the EcoSeq6 database (1,278 """"""""intergenic"""""""" sequences of the combined length of 359,279 basepairs) were analyzed using computer- assisted methods with the aim of identifying putative unknown genes. The proposed strategy for finding new genes includes two key elements: i) prediction of expressed open reading frames (ORFs) using the GeneMark method based on Markov chain models for coding and non-coding regions of Escherichia coli DNA, and ii) search for protein sequence similarities using programs based on the BLAST algorithm and programs for motif identification. A total of 358 putative expressed ORFs were predicted by GeneMark. Using the BLASTX and TBLASTN programs, it was shown that 206 ORFs located in the unannotated regions of the E. coli chromosome are significantly similar to other protein sequences. Identification of 184 ORFs as probable genes was supported by both GeneMark and BLAST, comprising 51.4% of the GeneMark """"""""hits"""""""" and 89.1% of the BLAST """"""""hits"""""""". 72 putative new genes or 20.1% of the GeneMark predictions belong to ancient conserved protein families including both eubacterial and eukaryotic members. This value is close to the overall proportion of highly conserved sequences among eubacterial proteins, indicating that the majority of the putative expressed ORFs that are predicted by GeneMark, but have no significant BLAST hits, nevertheless are likely to be real genes. The majority of the putative genes identified by BLAST search have been described since the release of the EcoSeq6 database, but about 70 genes have not been detected so far. Among these new identifications are genes encoding proteins with a variety of predicted functions including dehydrogenases, kinases, several other metabolic enzymes, ATPases, rRNA methyltransferases, membrane proteins and different types of regulatory proteins. A new family of bacterial and bacteriophage transglycosylase and a new family of rRNA methyltransferases were described.

Agency
National Institute of Health (NIH)
Institute
National Library of Medicine (NLM)
Type
Intramural Research (Z01)
Project #
1Z01LM000054-02
Application #
3759322
Study Section
Project Start
Project End
Budget Start
Budget End
Support Year
2
Fiscal Year
1994
Total Cost
Indirect Cost
Name
National Library of Medicine
Department
Type
DUNS #
City
State
Country
United States
Zip Code