This research focuses on reducing bioterrorism threat by integrating tools from genomics and statistics in ways that have not been previously examined. The investigators develop novel statistical theory and computational tools for accurate pathogen detection based on next generation sequencing data. Key research directions involve (i) classification by sequence enrichment; (ii) comparison of empirical clusterings and reference genomes; and (iii) shrinkage estimation and model selection in hierarchical log-linear models. In addition to an in-depth characterization of the theoretical properties of these new statistical inference techniques, the investigators perform a thorough assessment of their practical importance in the context of the detection and identification of bacterial genomes. This assessment is done using publicly available data from sources such as the Human Microbiome Project, the NCBI Short Read Archive, the European Bioinformatics Institute, and the Broad Institute. The applicability of this new methodology is broad and relates to high-dimensional settings in which choosing an appropriate class of candidate statistical models is difficult. The investigators study statistical ensembles, combinations of techniques that have been shown to provide more reliable inferences than any single statistical approach. As opposed to existing work which combines models from the same class, this new framework concerns ensembles that cross class boundaries and optimally combine inferences from multiple models from several model classes. These ensembles are expected to have distinct advantages over existing approaches, such as robustness to model misspecification and improved predictive performance.
The new statistical methodology developed in this proposal has the potential to substantially improve the response of federal and international agencies to a bioterrorism attack through a rapid identification of differences in microbial genomes and their accurate classification as harmless or potentially pathogenic. The impact of these algorithms for pathogen detection on both information technology and civil infrastructure is maximized through their implementation in user-friendly, open-source computational tools and software that will be freely available to the public. The project also has a significant educational and mentorship component for students and postdoctoral fellows who are interested in enhancing our ability to respond rapidly and appropriately to (i) incidents of bioterrorism, and (ii) microbial threats to public health.