Bacteria are the most abundant organisms on Earth, yet little is known about most members of this domain of life. Only about 1% of bacterial species can be easily grown in culture, and considerably fewer have been sequenced. Advances in sequencing technologies have made it possible to sequence bacteria directly from the environment, providing a dramatic new outlook on the diversity of bacteria populating our world. Initial studies have explored the bacteria present in mines, ocean water, and soil, as well as communities of commensal microbes that inhabit the human body. The latter have provided a glimpse at the complex symbiotic relationships between bacteria and their human hosts. Despite an increased interest in environmental sequencing (metagenomics), few specialized computational algorithms exist for the analysis of such data. For example, the assembly of environmental data is being performed with software originally intended for homogeneous DNA sources, such as clonal bacterial populations or inbred eukaryotes. These programs are ill-suited to the assembly of heterogeneous microbial communities and numerous """"""""hacks"""""""" have been necessary to produce the assemblies published to date. This proposal aims to fill the need for specialized software for assembling and finding genes in metagenomic datasets. A particular focus will be on developing tools for uncovering genomic variation within the assemblies of microbial communities. The proposed software will specifically address issues arising from the use of new sequencing technologies in metagenomic projects. The low cost and high throughput of these technologies will allow a far deeper exploration of the microbial biosphere than was previously possible. Their broad application, however, depends on the availability of software systems adapted to their specific characteristics. In addition, new algorithms will be developed to allow the individual components of a metagenomic analysis pipeline to be tightly integrated, with the goal of improving the overall quality of both assembly and annotation, and to facilitate the extraction of other types of information from large sets of metagenomic data. The proposal further aims to investigate the impact of experimental design and choice of sequencing technology on the ability to assemble and analyze metagenomic data, through the development of software for simulating bacterial populations and emulating a variety sequencing strategies. Better experimental design can reduce the high costs currently associated with environmental sequencing and enhance subsequent analyses. All software developed as part of this proposal, as well as any simulated data and results of reanalyzing public datasets will be released freely through public databases and open-source software repositories.

Public Health Relevance

Project Narrative Initial explorations of the communities of bacteria that inhabit our bodies have already provided insights into the complex relationships between microbes and the human host, as well as the contribution of bacteria to diseases such as obesity, and inflammatory bowel disease. Many more studies will be needed to help us fully understand the complex human-microbe interactions and to translate these discoveries into new therapies. The current proposal provides scientists with components of the software infrastructure that will be essential for genomic studies of the human microbiome.

National Institute of Health (NIH)
National Human Genome Research Institute (NHGRI)
Research Project (R01)
Project #
Application #
Study Section
Special Emphasis Panel (ZRG1-BST-F (50))
Program Officer
Bonazzi, Vivien
Project Start
Project End
Budget Start
Budget End
Support Year
Fiscal Year
Total Cost
Indirect Cost
University of Maryland College Park
Biostatistics & Other Math Sci
Schools of Arts and Sciences
College Park
United States
Zip Code
Davison, Michelle; Treangen, Todd J; Koren, Sergey et al. (2016) Diversity in a Polymicrobial Community Revealed by Analysis of Viromes, Endolysins and CRISPR Spacers. PLoS One 11:e0160574
Pop, Mihai; Walker, Alan W; Paulson, Joseph et al. (2014) Diarrhea in young children from low-income countries leads to large-scale alterations in intestinal microbiota composition. Genome Biol 15:R76
Paulson, Joseph N; Stine, O Colin; Bravo, H├ęctor Corrada et al. (2013) Differential abundance analysis for microbial marker-gene surveys. Nat Methods 10:1200-2
Treangen, Todd J; Koren, Sergey; Sommer, Daniel D et al. (2013) MetAMOS: a modular and open source metagenomic assembly and analysis pipeline. Genome Biol 14:R2
Ghodsi, Mohammadreza; Hill, Christopher M; Astrovskaya, Irina et al. (2013) De novo likelihood-based measures for comparing genome assemblies. BMC Res Notes 6:334
Schatz, Michael C; Phillippy, Adam M; Sommer, Daniel D et al. (2013) Hawkeye and AMOS: visualizing and assessing the quality of genome assemblies. Brief Bioinform 14:213-24
Gevers, Dirk; Pop, Mihai; Schloss, Patrick D et al. (2012) Bioinformatics for the Human Microbiome Project. PLoS Comput Biol 8:e1002779
Human Microbiome Project Consortium (2012) Structure, function and diversity of the healthy human microbiome. Nature 486:207-14
Salzberg, Steven L; Phillippy, Adam M; Zimin, Aleksey et al. (2012) GAGE: A critical evaluation of genome assemblies and assembly algorithms. Genome Res 22:557-67
Liu, Bo; Faller, Lina L; Klitgord, Niels et al. (2012) Deep sequencing of the oral microbiome reveals signatures of periodontal disease. PLoS One 7:e37919

Showing the most recent 10 out of 23 publications