This project will support the continued development and maintenance of four bioinformatics software systems that are widely used in research on gene finding and genome annotation. The first of these, Glimmer, is used to find genes in bacteria, viruses, archaea, and simple eukaryotes. Glimmer is highly accurate, finding over 99% of the genes in most bacteria. It has been used by thousands of scientists around the world, including the majority of published bacterial and archival genome sequencing projects over the past decade. Collectively the three main publications describing Glimmer have been cited over 2,600 times, including 400 citations in 2012 alone. Usage of Glimmer has increased in recent years due to the explosion in next-generation sequencing projects, which are particularly cost-effective for bacterial genomes. Our very recent introduction of a new version of Glimmer customized for met genomics data is intended to make it available to microbiome researchers. Glimmer's algorithm is also the basis of PhymmBL, a new system for classifying sequences from metagenomics projects, which we will also support under this project. The second system, MUMmer, is a highly efficient system for whole-genome alignment that is widely used to compare bacterial genomes to one another and to compare genome assemblies to detect changes, both large and small. MUMmer and its components, especially Nucmer, have been widely used and have been incorporated in many other systems, including a recent multi-genome aligner, Mugsy, and several genome assembly packages. The three main publications describing MUMmer have been cited over 1,900 times including 200 citations in 2012. A major reason for the recent increase in usage of these systems, beyond the drop in sequencing costs, is the growth of metagenomics research, particularly the human microbiome project. This project will also support two other systems, TransTermHP and OperonDB, and the web databases that accompany them. TransTermHP finds transcription terminators in bacterial and archaeal genomes, and we have used it to build a website containing predictions for over 1500 genomes, all of which are freely downloadable. OperonDB includes a database and a software system that identifies operons in a collection of prokaryotic genomes using conserved synteny across species. Each of these systems have been widely used and cited, and this project requests funding to rebuild the databases on a larger collection of genomes and to continue to expand them as more genomes appear. All of the software and data generated by this project will continue to be freely available under an open source license, allowing unrestricted use by other researchers to use, modify, and redistribute them without restrictions of any kind.

Public Health Relevance

This project supports a suite of software packages that have been extensively used in the interpretation and analysis of many pathogenic organisms, including the bacteria that cause tuberculosis, cholera, anthrax, strep and staph infections, Lyme disease, syphilis, and many others. Ongoing support and development of this software will be essential in continuing research on these diseases, and also for the new challenges likely to emerge from efforts to sequence the diverse bacteria that live in the human body.

National Institute of Health (NIH)
National Institute of General Medical Sciences (NIGMS)
Research Project (R01)
Project #
Application #
Study Section
Biodata Management and Analysis Study Section (BDMA)
Program Officer
Lyster, Peter
Project Start
Project End
Budget Start
Budget End
Support Year
Fiscal Year
Total Cost
Indirect Cost
Johns Hopkins University
Schools of Medicine
United States
Zip Code
Salzberg, Steven L; Pertea, Mihaela; Fahrner, Jill A et al. (2014) DIAMUND: direct comparison of genomes to detect mutations. Hum Mutat 35:283-8
Wood, Derrick E; Salzberg, Steven L (2014) Kraken: ultrafast metagenomic sequence classification using exact alignments. Genome Biol 15:R46
Schatz, Michael C; Phillippy, Adam M; Sommer, Daniel D et al. (2013) Hawkeye and AMOS: visualizing and assessing the quality of genome assemblies. Brief Bioinform 14:213-24
Salzberg, Steven L; Phillippy, Adam M; Zimin, Aleksey et al. (2012) GAGE: A critical evaluation of genome assemblies and assembly algorithms. Genome Res 22:557-67
Langmead, Ben; Salzberg, Steven L (2012) Fast gapped-read alignment with Bowtie 2. Nat Methods 9:357-9
Kelley, David R; Liu, Bo; Delcher, Arthur L et al. (2012) Gene prediction with Glimmer for metagenomic sequences augmented by classification and clustering. Nucleic Acids Res 40:e9
Angiuoli, Samuel V; Dunning Hotopp, Julie C; Salzberg, Steven L et al. (2011) Improving pan-genome annotation using whole genome multiple alignment. BMC Bioinformatics 12:272
Magoc, Tanja; Salzberg, Steven L (2011) FLASH: fast length adjustment of short reads to improve genome assemblies. Bioinformatics 27:2957-63
Angiuoli, Samuel V; Salzberg, Steven L (2011) Mugsy: fast multiple alignment of closely related whole genomes. Bioinformatics 27:334-42
Schatz, Michael C; Delcher, Arthur L; Salzberg, Steven L (2010) Assembly of large genomes using second-generation sequencing. Genome Res 20:1165-73

Showing the most recent 10 out of 33 publications