The E. coli genome contains over 3000 genes and is currently 40% sequenced. A complete high resolution restriction map for the entire genome is available. This makes the E. coli genome project the most advanced of all cellular genome projects. This information has been collected and organized into a cohesive information base, unifying the efforts of many laboratories into a single data resource. This project includes software development, database development, and data analysis. The software that has been developed or enhanced during the reporting period includes programs to assemble restriction map and DNA sequence data into a single DNA sequence file, termed BigSeq, improved graphical representations of genomic map and sequence data, and a program to find inexact pattern matches in DNA sequences, called SiteFinder. Two relational databases are being developed: GeneScape, a Macintosh database of genomic map information that is essentially completed, and EC-BASE, a Sybase database of E. coli map and DNA sequence information. DNA sequence and genomic restriction map data has been analyzed to determine the information content of ribosome binding sites, number and distribution of genomic restriction sites, repeated patterns in DNA sequences, distribution and categorization of proteins encoded in the genome, assignment of genes to individual clones in the ordered clone set of the E. coli genome, and the detection of putative new genes in the DNA sequence flanking known genes. In addition, the gaps that remain to be sequenced in the E. coli genome have been analyzed, revealing an unexpectedly large number of gaps less than 2000 bp long, suggesting that a directed, PCR-based gap-closing strategy for the completion of the genomic sequence should be seriously considered.