The project aims to improve microbial classification and comparison from next-generation sequencing technology. The first objective is to improve microbial identification from short reads, using a Bayesian classifier. While the classifier is fast, it can rely on a large set of features. This is due to its reliance on fixed DNA word sizes that may have many zero-frequencies when the word size is long. The investigators propose to compensate the classifier with a zero-inflated negative binomial and Poisson models, instead of traditionally using linguistic "smoothing" techniques that are ad-hoc at best. The second objective is to reduce the feature size for whole-genome analysis. For long DNA word sizes, there is an enormous feature space, and by using random manifolds, compressive sensing, and other techniques, the investigators propose to reduce the feature space while retaining accuracy of microbial classification. Finally, the third objective is to be able to model and fit functions to microbial population changes in a gradient (a changing environmental factor), especially when many of the data points are missing. This final objective will allow biologists and ecologists to now correlate the microbial composition (from the first two objectives) to environmental factors and to model microbial changes and thus improve future threat detection.
The investigators are developing mathematical methods to model how an environment is uniquely identified by its microbial community. Because a chemical will not have to be measured directly, the projects' results will enable advances in biotechnology for trace chemical detection and forensics. An example is modeling soil microbial community changes in response to buried explosives in order to enhance detection of these devices and secure our troops.