Many of today's most challenging problems in Big Data involve dealing with the problem of multiple testing, including microarray and other bioinformatic analyses, syndromic surveillance, high-throughput screening, and many others. The challenge when simultaneously conducting thousands or millions of tests is to develop testing methodology that can detect true signals but prevent false discoveries. Crucial contexts for this research include subgroup analysis (searching for a treatment effect in subgroups of the entire population) and multiple endpoint analysis (simultaneously looking for different treatment effects), in interfaces with the pharmaceutical industry and with partial focus on personalized medicine. Development of computer models is crucial in understanding complex processes, as is understanding the interfaces of the computer models with data and uncertainty, often named Uncertainty Quantification. The immediate science applications of the research on Uncertainty Quantification will be to prediction of geophysical hazard probabilities and to models of wind fields. Two of the most significant reasons for the recent concern over reproducibility of science are the failure to control for multiplicities and the common misinterpretation of p-values. In addition to the earlier mentioned multiplicity control, the project will investigate the possibility of converting p-values to more interpretable quantities, such as the odds of the null to the alternative hypothesis.
The Bayesian approach to multiple testing has the attraction that it is optimally powered for detection, even in the face of highly dependent data or tests statistics, while exerting strong control to prevent false discoveries. The barriers to its implementation are in developing the appropriate prior probability structures and carrying out the computation. While numerous aspects of Uncertainty Quantification will be investigated, the project will particularly focus on the crucially needed development of emulators (approximations) to complex computer models which output massive space-time data fields. Finding situations in which optimal Bayesian and optimal frequentist procedures agree has major benefits, both foundational and practical. Such new agreements typically arise through development of new conditional frequentist procedures. This will be done in the context of two methodologies, study of the odds of correct to false discovery and multiple endpoint testing. This will also be generalized to more general model uncertainty problems, using robust Bayesian analysis.