The next generation sequencing technology has enabled the rapid sequencing of mixed genomes directly sampled from the environment, which is recently emerged as metagenomics. By direct sequencing, researchers can study organisms that are not easily cultured or even cannot be cultured at all in the laboratory. Based on the sequence data from a metagenomic sample the basic questions will be addressed include "what species or genomes are there?", "what are their relative abundance?", and "how many more species will be detected if more sequence reads are obtained?" The investigators propose to incorporate computational and statistical thinking concepts in modeling metagenomics sequencing data and estimating the multiple genomes and their relative abundance within a metagenomics sample. In particular, the investigator and her colleagues propose to combine clustering method and mixture modeling framework to estimate the multiple genomes and their relative abundance in the metagenomics sample based on the hits from aligning the sequence reads to known reference sequences. This mixture modeling framework can be further extended to include fine-tuning parameters such as position-specific sequencing errors. Conventional statistical and computational methods and algorithms for computing the point estimate and confidence interval for the species richness in the sample are evaluated for metagenomics data and new methods will be developed if necessary.

Metagenomics provides a tool to study the genetic materials which are directly recovered from a natural (such as soil and seawater) or a host-associated (such as human gut) community. Identifying the multiple genomes in a single data set is a challenging problem, particularly when the species are represented at vastly different abundance. The algorithms and methods developed in this proposal can be applied to metagenomics studies in different fields including human health, environment, agriculture, and identification of viruses in biological threats and infectious diseases. The statistical models and computational algorithms will be integrated into open-source R software and made publicly available for the community to enable other researchers to analyze their own metagenomics data. A post-doctoral fellow is trained in this project.

Project Report

The next generation sequencing technology has greatly promoted the field of metagenomics, which studies multiple genomes from an environment without culturing the individual organisms. By direct sequencing, researchers can study organisms that are not easily cultured or even cannot be cultured at all in the laboratory. Metagenomics has been applied in different fields including human health, agriculture, and identification of viruses in biological threats and infectious diseases. Characterization of genomic composition of a metagenomics sample is essential for understanding the structure of the microbial community. Based on the sequence data from a metagenomic sample, we proposed a mixture model to identify multiple genomes contained in a metagenomic sample and to estimate their relative abundances. The proposed method was comprehensively tested on both simulated datasets and real datasets. It assigns reads to the low taxonomic ranks very accurately. The results are published. The open-source R codes and the R package TAMER, implementing our statistical approach of taxonomic assignment of metagenomic reads, are publicly available online for the community to enable other researchers to analyze their own metagenomics data. This award results in more than 10 publications. The PIs, post-doctoral fellows, and the graduate student gave over 40 oral and poster presentations about this research at seminars, colloquiums, and national professional conferences. The post-doctoral fellows and one graduate student supported and trained through this award also participated in the responsible conduct of research (RCR) training. In addition to attending the department seminars and other seminars at their own universities, the postdocs and graduate student were also supported to attend and present at national professional conferences and NSF workshops.

Agency
National Science Foundation (NSF)
Institute
Division of Mathematical Sciences (DMS)
Type
Standard Grant (Standard)
Application #
1043080
Program Officer
Leland Jameson
Project Start
Project End
Budget Start
2010-10-01
Budget End
2014-09-30
Support Year
Fiscal Year
2010
Total Cost
$725,009
Indirect Cost
Name
Northwestern University at Chicago
Department
Type
DUNS #
City
Chicago
State
IL
Country
United States
Zip Code
60611