This award is funded under the American Recovery and Reinvestment Act of 2009 (Public Law 111-5).
Correlated data are seen in diverse fields of sciences and humanities, ranging from computational biology and geology to health and social studies. However, modern correlated data frequently present additional complications such as high-dimensionality, nonlinearity and nongaussianity, for which more complex models are needed. Often times, these models require a large number of parameters, and the number of the parameters can increase with the sample size and can even be greater than the sample size. These pose significant challenges on both theoretical and computational fronts, as it is difficult to directly apply traditional likelihood techniques. This proposal aims to develop innovative statistical procedures and efficient computing algorithms for analyzing correlated data with complicated features using mixed models. The investigator focuses on three classes of mixed models: linear mixed models, nonlinear mixed models and generalized mixed models, and extends the concept of partial consistency and the nonconcave penalized least squares method to address several challenging issues in mixed model estimation and selection. In particular, the investigator 1) explores the concept of partial consistency in linear mixed models and develops a simple yet robust two-step estimation method; 2) develops penalized least squares methods to select fixed effects as well as the covariance and precision matrices of random effects; 3) formulates the nonlinear mixed model estimation and testing problems as model selection problems and develops a group selection method for nonlinear mixed models; 4) extends the proposed penalized least squares method to ultra-high dimensional variable selection; and 5) generalizes the proposed two-step estimation method and penalized least squares method to generalized linear mixed models.
The research findings of this proposal will greatly broaden the applications of mixed models, especially in jointly modeling different types of data. For example, one can jointly model clinical and genomics data in a unified way where genomics data such as gene expressions are treated as random effects and explain the heterogeneity among groups while clinical data such as age, gender and blood pressure are treated as fixed effects and are of primary interest. Such a joint modeling approach allows one to account for the correlation among genes and to study genes or genetic pathways in a system way rather than traditional gene-by-gene way. Moreover, the research findings of this proposal will also shred light on analyzing high-dimensional and massive data. For example, by extending the concept of partial consistency to mixed models, one will have better understandings as how to explore the unique structure of high-dimensional data in order to extract valuable information to produce consistent, efficient and robust estimates for some parameters. In addition, the proposed methodologies will be introduced to researchers in other areas through interdisciplinary collaboration work, and will also be integrated into the investigator's educational activities by developing graduate and undergraduate curriculums and by training graduate students. Open source R and Matlab codes implementing the proposed methodologies will be made available to general public.