Intellectual Merit: The identification of coding regions or genes present in the genome of organisms is generally insufficient if based primarily on DNA sequence and in absence of experimental verification. Retrospective analysis of genome annotation with proteomics data improves its quality and completeness. Unfortunately, proteogenomically improved gene models are rarely incorporated into the public knowledge base. This project seeks to build a proteogenomics software pipeline that will enable and improve primary genome annotation. This pipeline will initially be used to re-annotate 30 prokaryotic genomes from six representative phyla: euryarchaeota, cyanobacteria, actinobacteria, firmicutes, deinococcus-thermus and proteobacteria. The anticipation is that tens of thousands of validated gene models or verified protein maturation events can be established. As target genomes are improved, the corrections can be propagated to an estimated 300 additional, highly homologous genomes. To ensure broad public accessibility, all findings will be incorporated into RefSeq and GenBank.
Broader Impact: The project will facilitate education and training through the development of a proteogenomics curriculum to be used in bioinformatics and genomics courses at universities and science workshops. Additionally, two high school teachers will be mentored in curriculum development.