Over the last decade, bacterial and archaeal communities have been identified in virtually every ecological niche examined, from acid mine drainage to tropospheric clouds to the human body. Understanding the make-up and interactions within these communities is crucial in understanding how each ecosystem functions. Metagenomics (or community genomics) is the approach whereby the collective genomic content of a naturally occurring microbial community can be determined without the need to isolate and culture its constituents independently. When assembled, the genetic framework of the community, including critical information on population structure, phylogenetic diversity, as well as novel genetic and biochemical activities can be obtained. The potential impact of such findings on biotechnology, medicine and ecology are enormous. Recently, next-gen sequencing technologies (e.g. Roche/454, Illumina, and Life Technologies (SOLiD)) have replaced traditional Sanger sequencing for metagenomic data generation. These are cost effective, clone-free, massively parallel technologies capable of producing as many as 250 gigabases of data in a single machine run. That level of sequencing makes it feasible to reconstruct even low abundance genomes as well as determine intraspecies heterogeneity from a community sample. However, the next-gen technologies also present their own computational challenges including the sheer volume of data to be processed together with the different read lengths, error models and formats unique to each technology. These complexities together with those posed the communities themselves together has left researchers to cobble together various combinations of software tools to process and analyze their data. Most software that can handle these large, complex data sets also require substantial computing resources and computer expertise beyond that of a normally equipped lab. These difficulties continue to have a serious stifling effect on the advances in science and technology that metagenomics offers. The long term goal of this proposal is to develop a seamless, commercial-grade metagenomic sequence assembly and analysis pipeline that is fully scalable to any size project. The proposed software will be easy to use and run on a desktop computer costing less than $5000 so that any reasonably funded laboratory or clinic can exploit metagenomic technology. Toward that goal, this Phase I proposal focuses on a solution to the central task of processing massive next-gen metagenomic data sets on a desktop computer. We will evaluate whether our new non-memory bound assembly engine, XNG, can meet the challenges in three crucial steps: 1) removing reads derived from contaminating host DNA, 2) """"""""recruiting"""""""" the potentially hundreds of millions remaining reads into appropriate phylogenetic bins based on matches to a local reference genome database, and 3) converting genome sequences from multiple strains of a given species into a single annotated entry (the """"""""pan-genome"""""""") for enhanced read recruitment and downstream annotation.

Public Health Relevance

Human health and medicine are greatly influenced by the microbial communities that are integral parts of our bodies as well as those that produce useful antibiotics for example. To better exploit the potential of these communities, metagenomic studies are producing vast amounts of Next-gen DNA sequence data to decipher their collective genetic content. This project focuses on developing computer software capable of reconstructing microbial community genomes and analyzing their content.

National Institute of Health (NIH)
National Human Genome Research Institute (NHGRI)
Small Business Innovation Research Grants (SBIR) - Phase I (R43)
Project #
Application #
Study Section
Special Emphasis Panel (ZRG1-IMST-K (14))
Program Officer
Proctor, Lita
Project Start
Project End
Budget Start
Budget End
Support Year
Fiscal Year
Total Cost
Indirect Cost
Dnastar, Inc.
United States
Zip Code