Predictive modeling is the cornerstone of individualized health care. The outcome of interest is most frequently the presence or absence of a health condition, and a large number of predictors are commonly available for model building. Both the high dimensional data and the missing data have posed great challenges in statistical inference related to predictive modeling. The overarching goal of this proposal is to address methodological challenges of predicting binary outcomes with high-dimensional incomplete data. Specifically, the PIs proposed to address the methodological challenges from the following two perspectives: (1) Quantify the uncertainty for the risk prediction based on the high-dimensional logistic model; (2) Accommodate two study designs where missingness happens in a structured way, including the ?Positive-only? study design and the two-phase design. Recent years have seen great breakthroughs in statistical inference methods for analyzing high-dimensional data arising from a wide spectrum of scientific fields, with a focus primarily on a single regression coefficient in the generalized linear models. Inferential methods for confidence interval construction and hypothesis testing for the predicted probability, which is a function of all regression coefficients, are largely lacking. We develop innovative statistical methods in this proposal towards filling this methodological gap in high dimensional data analysis. Our proposed method is innovative also because they accommodate the structured incomplete data which arises from important sampling designs. To our best knowledge, to date, statistical inference methods for high dimensional data analysis have exclusively focused on data arising from complete data arising from cross-sectional study designs. We additionally consider two important study designs with incomplete data, one is termed as the ?positive-only? study design that arises in EHR phenotyping, and the other is the two-phase design, an important cost-effective sampling design that aims to reduce cost for measuring expensive predictors. We elucidate methodological challenges of accommodating the missing data issues in downstream analysis and provide corresponding solutions.
Our work is expected to lead to methodological advancements in high dimensional risk modeling with incomplete data, which is instrumental in the individualized health care. Accompanied by user-friendly software, it will offer researchers state-of-art statistical tools for risk prediction with high-dimensional predictors. Applications of these methods by researchers in biostatistics and epidemiology are expected to help answer important scientific questions through predictive modeling beyond EHR data analysis, and ultimately to contribute to improved health care.