Researchers have been facing challenges from high-dimensional data that contain many different characteristics for each subject. For example, biomedical scientists analyze tens of thousands of genomes to determine the cause of disease and find the most promising treatments; social scientists study differential policy impacts by leveraging vast amounts of personal data gathered from social media. This project lays out two lines of research aimed at developing valid statistical inference in this challenging environment. The insights and tools developed from this project would contribute to the advancement of a wide variety of other disciplines, including economics, education, health care, and biomedical sciences. In addition, graduate students will be engaged in the project to study the statistical guarantees and to develop relevant software packages for the research. Since the project develops modern statistical methodologies with substantial applications, it will be suitable for training graduate students with a broad range of skills.

This project addresses two research problems and seeks to provide insights, theory and tools for more informed decision making in high dimensions. The first problem focuses on studying the treatment effect heterogeneity. Quantification and characterization of heterogeneous treatment effects play an increasingly important role in evaluating the efficacy of social programs and medical treatments in the presence of high-dimensional covariates. In particular, the research will develop efficient procedures for estimating heterogeneous quantile treatment effects and subgroup average treatment effects via covariate balancing. The second problem focuses on studying the validity and invalidity of data splitting for conducting inference with high-dimensional data. The project will address the issue of ?random-splitting bias? in estimating the regression coefficients when the number of dummy/imbalanced variables is sizable. To overcome this challenge, the project will develop a guided data splitting framework that splits the data into more balanced halves. Because the usage of random data splitting goes beyond post-selection inference, the framework developed and the possible solutions derived from it are broadly applicable to many data-driven investigations.

This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.

Agency
National Science Foundation (NSF)
Institute
Division of Mathematical Sciences (DMS)
Type
Standard Grant (Standard)
Application #
2015325
Program Officer
Huixia Wang
Project Start
Project End
Budget Start
2020-07-01
Budget End
2023-06-30
Support Year
Fiscal Year
2020
Total Cost
$220,000
Indirect Cost
Name
University of California Berkeley
Department
Type
DUNS #
City
Berkeley
State
CA
Country
United States
Zip Code
94710