The rapid development of technology has led to the tremendous growth of large-scale heterogeneous data in science, economics, engineering, healthcare, and many other disciplines. For example, in a modern health information system, electronic health records routinely collect a large amount of information on many patients from heterogeneous populations across different disease categories. Such data provide unique opportunities to understand the association between features and outcomes across different subpopulations. Existing approaches have not fully addressed the formidable computational and statistical challenges. To tap into the true potential of information-rich data, this project will develop a new computational and statistical paradigm and solid theoretical foundation for analyzing large-scale heterogeneous data. In addition the project will also provide research training opportunities for graduate students.
The project will build a unified, quantile-modeling based framework with an overarching goal of achieving effectiveness and reliability in analyzing heterogeneous data, especially when both the number of potential explanatory variables and the sample size are large. The specific goals are (1) to develop resampling-based inference for large-scale heterogeneous data; (2) to develop Bayesian algorithms and scalable and interpretable structure-aware approach for better inference; (3) to develop quantile-optimal decision rule estimation and inference with many covariates; (4) to develop novel estimation and inference procedure for large-scale quantile regression under censoring. The project will address some of the key barriers in scalability to data size and dimensionality, exploration of heterogeneity and structures, need for robustness, and the ability to make use of incomplete observations.
This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.