This project will support the continued development and maintenance of four bioinformatics software systems that are widely used in research on gene finding and genome annotation. The first of these, Glimmer, is used to find genes in bacteria, viruses, archaea, and simple eukaryotes. Glimmer is highly accurate, finding over 99% of the genes in most bacteria. It has been used by thousands of scientists around the world, including the majority of published bacterial and archival genome sequencing projects over the past decade. Collectively the three main publications describing Glimmer have been cited over 2,600 times, including 400 citations in 2012 alone. Usage of Glimmer has increased in recent years due to the explosion in next-generation sequencing projects, which are particularly cost-effective for bacterial genomes. Our very recent introduction of a new version of Glimmer customized for met genomics data is intended to make it available to microbiome researchers. Glimmer's algorithm is also the basis of PhymmBL, a new system for classifying sequences from metagenomics projects, which we will also support under this project. The second system, MUMmer, is a highly efficient system for whole-genome alignment that is widely used to compare bacterial genomes to one another and to compare genome assemblies to detect changes, both large and small. MUMmer and its components, especially Nucmer, have been widely used and have been incorporated in many other systems, including a recent multi-genome aligner, Mugsy, and several genome assembly packages. The three main publications describing MUMmer have been cited over 1,900 times including 200 citations in 2012. A major reason for the recent increase in usage of these systems, beyond the drop in sequencing costs, is the growth of metagenomics research, particularly the human microbiome project. This project will also support two other systems, TransTermHP and OperonDB, and the web databases that accompany them. TransTermHP finds transcription terminators in bacterial and archaeal genomes, and we have used it to build a website containing predictions for over 1500 genomes, all of which are freely downloadable. OperonDB includes a database and a software system that identifies operons in a collection of prokaryotic genomes using conserved synteny across species. Each of these systems have been widely used and cited, and this project requests funding to rebuild the databases on a larger collection of genomes and to continue to expand them as more genomes appear. All of the software and data generated by this project will continue to be freely available under an open source license, allowing unrestricted use by other researchers to use, modify, and redistribute them without restrictions of any kind.

Public Health Relevance

This project supports a suite of software packages that have been extensively used in the interpretation and analysis of many pathogenic organisms, including the bacteria that cause tuberculosis, cholera, anthrax, strep and staph infections, Lyme disease, syphilis, and many others. Ongoing support and development of this software will be essential in continuing research on these diseases, and also for the new challenges likely to emerge from efforts to sequence the diverse bacteria that live in the human body.

National Institute of Health (NIH)
National Institute of General Medical Sciences (NIGMS)
Research Project (R01)
Project #
Application #
Study Section
Biodata Management and Analysis Study Section (BDMA)
Program Officer
Lyster, Peter
Project Start
Project End
Budget Start
Budget End
Support Year
Fiscal Year
Total Cost
Indirect Cost
Johns Hopkins University
Schools of Medicine
United States
Zip Code
Li, Zhigang; Breitwieser, Florian P; Lu, Jennifer et al. (2018) Identifying Corneal Infections in Formalin-Fixed Specimens Using Next Generation Sequencing. Invest Ophthalmol Vis Sci 59:280-288
Pertea, Mihaela; Shumate, Alaina; Pertea, Geo et al. (2018) CHESS: a new human gene catalog curated from thousands of large-scale RNA sequencing experiments reveals extensive transcriptional noise. Genome Biol 19:208
Breitwieser, F P; Baker, D N; Salzberg, S L (2018) KrakenUniq: confident and fast metagenomics classification using unique k-mer counts. Genome Biol 19:198
Lu, Jennifer; Salzberg, Steven L (2018) Removing contaminants from databases of draft genomes. PLoS Comput Biol 14:e1006277
Luo, Ruibang; Zimin, Aleksey; Workman, Rachael et al. (2017) First Draft Genome Sequence of the Pathogenic Fungus Lomentospora prolificans (Formerly Scedosporium prolificans). G3 (Bethesda) 7:3831-3836
Zimin, Aleksey V; Stevens, Kristian A; Crepeau, Marc W et al. (2017) An improved assembly of the loblolly pine mega-genome using long-read single-molecule sequencing. Gigascience 6:1-4
Pertea, Mihaela; Kim, Daehwan; Pertea, Geo M et al. (2016) Transcript-level expression analysis of RNA-seq experiments with HISAT, StringTie and Ballgown. Nat Protoc 11:1650-67
Kim, Daehwan; Song, Li; Breitwieser, Florian P et al. (2016) Centrifuge: rapid and sensitive classification of metagenomic sequences. Genome Res 26:1721-1729
Breitwieser, Florian P; Pardo, Carlos A; Salzberg, Steven L (2015) Re-analysis of metagenomic sequences from acute flaccid myelitis patients reveals alternatives to enterovirus D68 infection. F1000Res 4:180
Pop, Mihai; Salzberg, Steven L (2015) Use and mis-use of supplementary material in science publications. BMC Bioinformatics 16:237

Showing the most recent 10 out of 76 publications