This project will support our work on computational methods for microbial sequence analysis, including gene finding, whole-genome alignment, genome assembly, and metagenomic sequence analysis. Over the years we have developed multiple systems to solve problems in these areas, some of which are very widely used. These tools need continued updates and improvements to keep pace with changes in sequencing technology, changes in experimental design, and the ever-growing number of sequenced genomes. One of these systems is Glimmer, a computational method for finding genes in bacteria, viruses, archaea, and simple eukaryotes. Glimmer is highly accurate, finding over 99% of the genes in most prokaryotic genomes. It has been used by thousands of scientists around the world and in the majority of published bacterial genome sequencing projects over the past decade. Collectively the three main publications describing Glimmer have been cited over 4,700 times, including >700 citations in 2016-17 alone. Usage of Glimmer has been increased in recent years due to the explosion in next-generation sequencing projects, which are particularly cost-effective for bacterial genomes. A second system, MUMmer, is an efficient whole-genome aligner that is used to compare genomes to one another and to compare genome assemblies to detect changes, both large and small. MUMmer and its components, especially Nucmer, have been widely used and incorporated in other systems, including multi-genome aligners and several genome assembly packages. The three main publications describing MUMmer have been cited over 3,600 times including >750 citations in 2016-17. In recent years we have focused our efforts on developing methods for the analysis of metagenomics data, producing several newer tools, including Kraken and Centrifuge. Both of these systems attempt to assign a species identifier to every read in a metagenomics data set. Because the Kraken algorithm is not only accurate but far faster than earlier methods, it was rapidly adopted by many labs soon after its release, and its usage continues to grow. The even newer and more space- efficient Centrifuge system has also been highly successful and was recently incorporated into the analysis package of one of the new third-generation sequencing companies. We continue to work on improving the performance of both algorithms, and this project will allow us to extend them to handle the newest long-read data that is increasingly being used for metagenomics experiments. Finally, a new direction of the lab is the use of metagenomic shotgun sequencing to diagnose infections, for which we are not only modifying our algorithms, but also building customized genome databases where we rigorously screen the genomes to identify and remove contaminants and low-complexity sequences that create false positives. As we have done for many years, we will release all of the software and data generated by this project for free under an open source license, allowing other scientists to use, modify, and redistribute them without restrictions of any kind.
This project supports a suite of software packages that are very widely used in the interpretation and analysis of many pathogenic organisms, including the bacteria that cause tuberculosis, cholera, anthrax, strep and staph infections, Lyme disease, syphilis, and many others. It also supports the development of new methods that address the many new challenges emerging from the human microbiome project, and from other metagenomics efforts to sequence the amazingly diverse microorganisms that live in the environment around us.