This project will support the continued development and maintenance of four bioinformatics software systems that are widely used in research on gene finding and genome annotation. The first of these, Glimmer, is used to find genes in bacteria, viruses, archaea, and simple eukaryotes. Glimmer is highly accurate, finding over 99% of the genes in most bacteria. It has been used by thousands of scientists around the world, including the majority of published bacterial and archival genome sequencing projects over the past decade. Collectively the three main publications describing Glimmer have been cited over 2,600 times, including 400 citations in 2012 alone. Usage of Glimmer has increased in recent years due to the explosion in next-generation sequencing projects, which are particularly cost-effective for bacterial genomes. Our very recent introduction of a new version of Glimmer customized for met genomics data is intended to make it available to microbiome researchers. Glimmer's algorithm is also the basis of PhymmBL, a new system for classifying sequences from metagenomics projects, which we will also support under this project. The second system, MUMmer, is a highly efficient system for whole-genome alignment that is widely used to compare bacterial genomes to one another and to compare genome assemblies to detect changes, both large and small. MUMmer and its components, especially Nucmer, have been widely used and have been incorporated in many other systems, including a recent multi-genome aligner, Mugsy, and several genome assembly packages. The three main publications describing MUMmer have been cited over 1,900 times including 200 citations in 2012. A major reason for the recent increase in usage of these systems, beyond the drop in sequencing costs, is the growth of metagenomics research, particularly the human microbiome project. This project will also support two other systems, TransTermHP and OperonDB, and the web databases that accompany them. TransTermHP finds transcription terminators in bacterial and archaeal genomes, and we have used it to build a website containing predictions for over 1500 genomes, all of which are freely downloadable. OperonDB includes a database and a software system that identifies operons in a collection of prokaryotic genomes using conserved synteny across species. Each of these systems have been widely used and cited, and this project requests funding to rebuild the databases on a larger collection of genomes and to continue to expand them as more genomes appear. All of the software and data generated by this project will continue to be freely available under an open source license, allowing unrestricted use by other researchers to use, modify, and redistribute them without restrictions of any kind.

Public Health Relevance

This project supports a suite of software packages that have been extensively used in the interpretation and analysis of many pathogenic organisms, including the bacteria that cause tuberculosis, cholera, anthrax, strep and staph infections, Lyme disease, syphilis, and many others. Ongoing support and development of this software will be essential in continuing research on these diseases, and also for the new challenges likely to emerge from efforts to sequence the diverse bacteria that live in the human body.

National Institute of Health (NIH)
National Institute of General Medical Sciences (NIGMS)
Research Project (R01)
Project #
Application #
Study Section
Biodata Management and Analysis Study Section (BDMA)
Program Officer
Lyster, Peter
Project Start
Project End
Budget Start
Budget End
Support Year
Fiscal Year
Total Cost
Indirect Cost
Johns Hopkins University
Schools of Medicine
United States
Zip Code
Pertea, Mihaela; Kim, Daehwan; Pertea, Geo M et al. (2016) Transcript-level expression analysis of RNA-seq experiments with HISAT, StringTie and Ballgown. Nat Protoc 11:1650-67
Kim, Daehwan; Song, Li; Breitwieser, Florian P et al. (2016) Centrifuge: rapid and sensitive classification of metagenomic sequences. Genome Res 26:1721-1729
Breitwieser, Florian P; Pardo, Carlos A; Salzberg, Steven L (2015) Re-analysis of metagenomic sequences from acute flaccid myelitis patients reveals alternatives to enterovirus D68 infection. F1000Res 4:180
Pop, Mihai; Salzberg, Steven L (2015) Use and mis-use of supplementary material in science publications. BMC Bioinformatics 16:237
Martinson, Vincent G; Magoc, Tanja; Koch, Hauke et al. (2014) Genomic features of a bumble bee symbiont reflect its host environment. Appl Environ Microbiol 80:3793-803
Salzberg, Steven L; Pertea, Mihaela; Fahrner, Jill A et al. (2014) DIAMUND: direct comparison of genomes to detect mutations. Hum Mutat 35:283-8
Merchant, Samier; Wood, Derrick E; Salzberg, Steven L (2014) Unexpected cross-species contamination in genome sequencing projects. PeerJ 2:e675
Wood, Derrick E; Salzberg, Steven L (2014) Kraken: ultrafast metagenomic sequence classification using exact alignments. Genome Biol 15:R46
Magoc, Tanja; Wood, Derrick; Salzberg, Steven L (2013) EDGE-pro: Estimated Degree of Gene Expression in Prokaryotic Genomes. Evol Bioinform Online 9:127-36
Magoc, Tanja; Pabinger, Stephan; Canzar, Stefan et al. (2013) GAGE-B: an evaluation of genome assemblers for bacterial organisms. Bioinformatics 29:1718-25

Showing the most recent 10 out of 68 publications