Many of today's most challenging problems in Big Data involve dealing with the problem of multiple testing, including microarray and other bioinformatic analyses, syndromic surveillance, high-throughput screening, and many others. The challenge when simultaneously conducting thousands or millions of tests is to develop testing methodology that can detect true signals but prevent false discoveries. Crucial contexts for this research include subgroup analysis (searching for a treatment effect in subgroups of the entire population) and multiple endpoint analysis (simultaneously looking for different treatment effects), in interfaces with the pharmaceutical industry and with partial focus on personalized medicine. Development of computer models is crucial in understanding complex processes, as is understanding the interfaces of the computer models with data and uncertainty, often named Uncertainty Quantification. The immediate science applications of the research on Uncertainty Quantification will be to prediction of geophysical hazard probabilities and to models of wind fields. Two of the most significant reasons for the recent concern over reproducibility of science are the failure to control for multiplicities and the common misinterpretation of p-values. In addition to the earlier mentioned multiplicity control, the project will investigate the possibility of converting p-values to more interpretable quantities, such as the odds of the null to the alternative hypothesis.

The Bayesian approach to multiple testing has the attraction that it is optimally powered for detection, even in the face of highly dependent data or tests statistics, while exerting strong control to prevent false discoveries. The barriers to its implementation are in developing the appropriate prior probability structures and carrying out the computation. While numerous aspects of Uncertainty Quantification will be investigated, the project will particularly focus on the crucially needed development of emulators (approximations) to complex computer models which output massive space-time data fields. Finding situations in which optimal Bayesian and optimal frequentist procedures agree has major benefits, both foundational and practical. Such new agreements typically arise through development of new conditional frequentist procedures. This will be done in the context of two methodologies, study of the odds of correct to false discovery and multiple endpoint testing. This will also be generalized to more general model uncertainty problems, using robust Bayesian analysis.

Agency
National Science Foundation (NSF)
Institute
Division of Mathematical Sciences (DMS)
Application #
1407775
Program Officer
Gabor Szekely
Project Start
Project End
Budget Start
2014-07-15
Budget End
2019-06-30
Support Year
Fiscal Year
2014
Total Cost
$599,996
Indirect Cost
Name
Duke University
Department
Type
DUNS #
City
Durham
State
NC
Country
United States
Zip Code
27705