Large-scale multiple testing is an important and rapidly growing area in modern Statistics. The proposed research focuses on new theories, methodologies and computational algorithms to address the fundamental questions and new challenges in this field. The investigator develops new concepts, data-driven schemes and solid theories that promise to improve the statistical efficiency and lay the foundation for simultaneous inferences in large-scale studies, especially when heterogeneity, dependence and other complex structures are present. The major components of the proposed research include: (i) the concept of simultaneously incorporating statistical significance and effect size in multiple testing and a new approach to identifying large non-null effects in heteroscedastic models; (ii) the strategy of exploiting spatial dependency and a new approach to testing correlated hypotheses in a hidden Markov random field; (iii) the strategy of grouping hypotheses in sets and a new approach to testing the significance of multiple groups of important variables; and (iv) the concepts of discovery boundary and effective screening, and a data-driven approach to reducing dimensionality by constructing subsets that are optimal in size and adaptive to unknown sparsity.

The proposed research has significant impact on many scientific applications such as genome-wide association studies, time-course microarray experiments, disease mapping in environmental studies, climate modeling, and medical imaging studies. The multiple testing and screening methods outlined in the proposal will improve the quality of simultaneous decision-making in complicated situations, yield more interpretable and reproducible scientific results, lead to great savings in costs in large-scale investigations, and hence help achieve the ultimate goal of understanding the underlying mechanisms in complex systems or human diseases in a precise, fast and cost-effective way. User-friendly software will be developed and made freely available for public use. Research results will be disseminated through publications, seminars and workshops. The investigator is committed to encouraging the participation of under-represented groups in science, and to integrating the proposed research into educational activities through developing new courses, and through mentoring and training students to work on the frontiers in Statistics with important health science applications.

Project Report

Large-scale multiple testing is an important and rapidly growing area in modern Statistics. The proposed research focuses on new theories, methodologies and computational algorithms to address the fundamental questions and new challenges in this field. During the project period, the investigator have developed new concepts, data-driven schemes and solid theories that promise to improve the statistical efficiency and lay the foundation for simultaneous inferences in large-scale studies, especially when heterogeneity, dependence and other complex structures are present. The major components of the proposed research include: (i) the concept of simultaneously incorporating statistical significance and effect size in multiple testing and a new approach to identifying large non-null effects in heteroscedastic models; (ii) the strategy of exploiting spatial dependency and a new approach to testing spatially correlated hypotheses in a hidden Markov random field; (iii) the strategy of grouping hypotheses in sets and a new approach to testing the significance of multiple groups of important variables; and (iv) the concepts of discovery boundary and effective screening, and a data-driven approach to constructing subsets that are optimal in size and adaptive to unknown sparsity. The PI has successfully achieved the goals of the proposed research. The four topics have lead to six publications in top statistical journals. The completed research has significant impact on many scientific applications such as genome-wide association studies, time-course microarray experiments, disease mapping in environmental studies, climate modeling, and medical imaging studies. Several important applications have been conducted. The multiple testing and screening methods outlined in the proposal will improve the quality of simultaneous decision-making in complicated situations, yield more interpretable and reproducible scientific results, lead to great savings in costs in large-scale investigations, and hence help achieve the ultimate goal of understanding the underlying mechanisms in complex systems or human diseases in a precise, fast and cost-effective way. User-friendly software have been developed and made freely available for public use. Research results have been disseminated through publications, seminars and workshops. The investigator is committed to encouraging the participation of under-represented groups in science, and to integrating the proposed research into educational activities through developing new courses, and through mentoring and training students to work on the frontiers in Statistics with important health science applications.

Agency
National Science Foundation (NSF)
Institute
Division of Mathematical Sciences (DMS)
Application #
1244556
Program Officer
Gabor J. Szekely
Project Start
Project End
Budget Start
2011-10-31
Budget End
2014-07-31
Support Year
Fiscal Year
2012
Total Cost
$163,344
Indirect Cost
Name
University of Southern California
Department
Type
DUNS #
City
Los Angeles
State
CA
Country
United States
Zip Code
90089