This research aims to develop novel methods that utilize coarsened data to assess the validity of model specifications. The proposed methods are motivated by the finding that statistical inference can be affected differently by different sources of model misspecification interacting with different (coarsened) data generating schemes. By evaluating the changes in inference outcomes as the data are strategically coarsened, the new methods can not only detect violation of model assumptions but also pinpoint the most influential source of model misspecification. This direction of research will lead to significant improvement on the existing model diagnostic methods, most of which only provide an overall assessment of goodness-of-fit or allow testing only one model assumption at a time. A crucial thread running through the investigation is the study of statistical inference based on coarsened data in the presence of model misspecification. The project will advance the understanding of the so-called "wrong model analysis" for coarsened data. As data are rarely collected ideally as one plans in practice, coarsened data are ubiquitous. Hence such understanding gained from the proposed research is valuable.
Statistical modeling is a key step of statistical analyses in nearly all fields of applications. A poorly formulated model often results in misleading conclusions. As researchers entertain increasingly complex statistical models to explain random phenomena, the need for more sophisticated diagnostic techniques becomes pressing. The investigator will conduct comprehensive analyses on the effects of model misspecification on statistical inference based on data of different structure. This knowledge will then be used to develop a rich class of informative diagnostics tools to protect data analysts from inappropriate model assumptions and to direct model improvement. The idea underlying the proposed methods is original and revolutionary. It advocates sacrifice of data information in order to reveal the mechanism that governs random phenomena, the insight unattainable without such counterintuitive sacrifice. The project will integrate research and education by sharing the rationale and the investigation with the graduate students who will work with the investigator or take the advanced topic course recently developed by the investigator.
In this project the PI developed versatile diagnostic methods for assessing the validity of a posited statistical model for an observed data set, where multiple assumptions imposed on the model are often in question. Besides identifying violation of certain assumption(s), the proposed methods can also point at the direction of model correction when model misspecification is detected. The PI considered a wide range of statistical models including linear mixed models (LMM), generalized linear mixed models (GLMM), nonlinear mixed models (NLMM), linear and generalized linear models (GLM) with error-prone covariates, and the induced models for group testing data. All these considered models are popular choices in a host of applications due to their practically meaningful interpretation, mathematically elegant formulation, and convenient implementation of statistical analysis based on these models using standard statistical software. Although it is well aware among practitioners that assumptions imposed on any one of these models can be suspect for a given application, and it is acknowledged by statisticians that violation of a parametric assumption typically compromises inference, very few existing diagnostic methods consider the scenario where more than one assumption are violated, and even fewer existing methods are able to provide guidance for model correction when an assumption is proved to be violated. The PI filled in these two important gaps in this project. The strategy used to tackle the above two rarely addressed (till now) tasks relies on creation of the so-called coarsened data. More specifically, one strategically creates a new data set from the observed data such that the inference based on the induced data differ from that based on the observed data in certain way only when certain assumption(s) is (are) violated in the model for the observed data. The induced data are generated by reducing information in some way from the observed data, and hence the name "coarsened data." The essence of this strategy is to use the change in statistical inference, which only happens in the presence of model misspecification, as information in the raw data are reduced to identify the source of model misspecification, as different sources of model misspecification can result in different patterns of such change. Furthermore, the direction of such change often relates to the direction in which the assumed model deviates from the true model, and hence one gains from such change some clues of how to correct the assumed model when evidence suggests that an assumption is inappropriate. Using this strategy, the PI was able to disentangle assumption violations on the random intercept versus those on random slopes in mixed effect models, and separate model misspecification on the error-prone true covariate from a wrong assumption on the link function in GLM with errors in covariates. Additionally, her proposed methods can indicate the direction of skewness of the true random-effect distribution. Similarly, using these methods, one can infer the skewness of the distribution of the error-prone true covariate as well as the skewness of the link function in an assumed GLM. Finally, the methods in the context of mixed effect models provide a new way for testing variance component. Different from the traditional variance component tests, the new test does not require the normality assumption on the random effect being tested. Besides the practically useful diagnostic tools, the research also produced theoretically valuable discoveries regarding inference based on wrong models and/or based on coarsened data. Specifically, the new diagnostic methods are built upon the theoretical investigation along two directions: first, effects of data coarsening on statistical inference based on a correct model; second, effects of model misspecification on inference based on the raw data and that based on coarsened data. The first direction has its practical value because data collected in a real-life study are often unintentionally less than "ideal" due to, say, measurement error, missingness (e.g., due to patient dropping out of a study), and grouping (e.g., due to constraint in budget or time, making individual testing data unavailable). The second direction also has its practical motivation since one can rarely be sure that a posited model is correct for a given data set. Along these two lines of investigation, the PI made some intriguing discoveries. One of the highlights is that inference based on group testing data can be more robust to data coarsening such as misclassification in the grouped responses or error contamination in covariates compared to inference based on individual testing data. This is true when bias and efficiency of inference are considered, especially when grouping are done randomly and the group sizes are chosen appropriately. Finally, motivated by the findings along the second direction, from which adverse effects of model misspecification are usually proved to be non-negligible, the PI developed new inferential methods that can avoid the adverse effects of certain model misspecification.