In the era of data science, statistical inference is the cornerstone of extracting useful information from complex data sets. Despite significant progress made in statistics, there remain many challenges in uncertainty quantification in confronting the complex and high-dimensional data. For instance, inherently discrete parameters and model structures are routinely encountered in data science and machine learning problems. For these intrinsically discrete structure problems, conventional statistical inference approaches do not apply. This project aims to develop a new inferential framework addressing the statistical inference questions for those difficult problems in high-dimensional and also rare events data analyses. The development of the framework will be transformative, since it will greatly expand the reach of statistical inference and uncertainty quantification and greatly improve our thinking and approach of making inference for many data science problems. The PIs will actively use the project to recruit and train students, especially underrepresented students, and also integrate the research output into teaching through developing topic courses to senior undergraduate students and graduate students at their home university. The obtained results will be disseminated in journal publications and conferences to enhance the understanding of the results in different communities. R packages for the proposed methods will also be released to the public.The graduate student support will be used on interdisciplinary research and writing codes.
Inherently discrete parameters and structures are prevalent in data science, for example, model indices in model selection problems, number of clusters and membership in classifications, number of layers and structure in deep neural network models, connectivity, membership and structure questions in network data, etc. Making inference for discrete parameters and structures is known to be a difficult task. A major challenge is that the large sample central limit theorem (CLT) no longer holds, and a Bayesian analysis is very sensitive and heavily impacted by the prior choice on the discrete model structure. This research project is aimed to develop a novel and general artificial-sample-based inferential framework, termed as, repro sampling. The idea of repro sampling is to create and study the performance of artificial samples that are generated by mimicking the sampling mechanism of the observed data; the artificial samples are then used to help quantify the uncertainty in estimation of model and parameters. The repro-sampling will guarantee the coverage property in finite sample and also can be extended to large sample. The proposed approaches are expected to be broadly applicable, efficient and computationally feasible. The main research goal is to fully develop the novel inferential framework of repro sampling. Three specific topics tailored to important and difficult inferential problems in data science will also be investigated: (A) Model selection and inference in high dimensional regression, nonparametric and deep learning models; (B) Predictive inference for high dimensional regression and data science; (C) Finite sample inference and fusion learning for rare events data. The research work will significantly advance the statistical methodology for the important yet challenging inference problems for discrete parameters, and broaden the applicability of uncertainty quantification to advanced machine learning methods. In addition, the research projects involve real databases and are ideally suited for engaging and training students and new researchers.
This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.