Evaluation of biological data often needs statistical insight to detect whether apparent treatment effects are real and useful. Typical applications are to personalized medicine and drug resistance. Standard statistical methods rely on unrealistic assumptions (data is supposed to be independent and identically distributed). This project will provide biologists with new tools for detecting, quantifying and leveraging hierarchical dependencies in areas of microbiology currently revolutionized by the emergence of new sequencing technologies. The PIs propose to tailor new ``treeness'' and ``clustering'' indices incorporating relevant distance and structural information computed from sequence and contingent information. The investigators will also use the treeness indices to provide improved multiple testing programs that improve the power of corrected multiple testing procedures in the case of hierarchical dependencies between variables. This will enhance the power in detecting significant functional differences between different conditions. The methods will first be developed and calibrated on data simulated according to known tree structures. Calibration will evaluate the indices under various types of perturbations and thresholding. The methods will then be used on real data generated, as part of the proposed work by directed evolution experiments in microbial ecology.

This work is an application-driven project for providing useful multiple testing correction under hierarchical dependencies. The goal is to tailor statistical methods to the exact needs of biologists working in bacterial ecology and in HIV/HCV drug resistance. This project provides the integration of a broad range of cutting edge mathematics, probability and statistics with computational advances that cater to the realities of data collection and analyses in the fields of phylogenetics, metatranscriptomics and metagenomics. Advances in the study of evolution, in microbial ecosystems (the human gut, sewage treatment plants) or virus evolution (HCV/HIV in a human host) would have repercussions on overall health practices at both the individual and epidemiological levels. Quantitative estimates of confidence in `entero-types' or other inferred clusters would be important in the cost analysis of personalized medicine. Students and Postdoctoral fellows will be trained both in biology and statistics, so they can understand the biologist's requests and constraints. Consulting workshops will be organized regularly where the effectiveness of planned experiments and applied statistics can be discussed. During the academic year, classes targeted to molecular biologists and microbiologists teach multivariate visualization and geometrical statistics methods using R. These will be open source and available from the class web pages. The PIs will offer several Summer schools in Microbiology and Metagenomics where they teach both multivariate statistics, phylogenetic analyses, metagenomic analysis, metatranscriptomics as well as experimental techniques for studying evolution in action.

Agency
National Science Foundation (NSF)
Institute
Division of Mathematical Sciences (DMS)
Type
Standard Grant (Standard)
Application #
1162538
Program Officer
Nandini Kannan
Project Start
Project End
Budget Start
2012-07-15
Budget End
2015-06-30
Support Year
Fiscal Year
2011
Total Cost
$300,021
Indirect Cost
Name
Stanford University
Department
Type
DUNS #
City
Stanford
State
CA
Country
United States
Zip Code
94305