Substantial advancement has been achieved over the past decade in high-dimensional data analysis with diverging number of covariates. However, when the research interest is focused on modeling the relationship between the response variable and a high-dimensional vector of covariates, most existing work only applies when the response variable is continuous and often requires stringent conditions such as independence or homogeneity. Many fundamental problems remain unsolved for high-dimensional data with discrete responses, especially when the standard modeling assumptions are not satisfied. This project aims to develop new statistical theory, methodology and algorithms for analyzing high-dimensional correlated or heterogeneous cross-sectional data with binary or count responses. More specifically, the investigator will (1) rigorously study the asymptotic theory, including consistency and asymptotic normality, of the semiparametric procedure of generalized estimating equations in the new diverging p asymptotic framework; (2) investigate generalized estimating equations based variable selection procedures for high-dimensional longitudinal and spatially correlated data; and (3) investigate the theory and methodology of sparse quantile regression, where the number of parameters may greatly exceed sample size, for analyzing heterogeneous data with possibly discrete responses.

The prevalence of high-dimensional binary and count data in various scientific fields, such as biomedical and health sciences, economics, social sciences and environmental studies, demands new statistical theory, methodology and software. Many important issues in analyzing high-dimensional binary or count data, especially in the presence of correlation or heterogeneity, have not been systematically studied. Moreover, existing work based on the full likelihood or the independence assumption in the high-dimensional setting cannot be readily applied. This project will make significant and timely contribution to the general theory and methodology of high-dimensional data analysis in the diverging p framework. Such theories are critical for guiding practical data analysis. Undergraduate and graduate students, especially those from underrepresented groups, will be encouraged to participate in this research project.

Project Report

High-dimensional longitudinal data, which consist of repeated measurements on a large number of covariates, have become increasingly common. In many large-scale health studies, such as the well known Framingham Heart Study, many variables including age, smoking status, cholesterol level, blood pressure were recorded on the participants over the years to describe their health characteristics and lifestyles. In a yeast cell-cycle gene expression data set we analyze, the gene expression measurements were captured at different time points during the cell cycle. The data set contains 297 cell-cycle regulated genes and the covariates are the binding probabilities for 96 transcription factors. In some other examples, even though the number of variables are not many, when we include various interaction effects the total number of covariates in the statistical model can be considerably large. The PI established rigorous statistical theory for generalized estimating equations, a popular technique for analyzing longitudinal data, in the high-dimensional setting. Furthermore, the PI and her collaborators proposed penalized estimating equations for variable selection when the number of covariates is large. Real life high-dimensional data often display heterogeneity due to either heteroscedastic variance or other forms of non-location-scale covariate effects. This type of heterogeneity is of scientific importance but tends to be overlooked by exiting procedures which mostly focus on the center of the conditional distribution. The PI and her collaborators investigated quantile regression based procedures for analyzing heterogeneous data. In particular, they established the oracle property of nonconvex penalized quantile regression with ultra-high dimensional covariates. The findings of the PI will provide scientists in many different disciplines with new and flexible tools for analyzing high dimensional correlated data or cross-sectional data that display heterogeneity.

Agency
National Science Foundation (NSF)
Institute
Division of Mathematical Sciences (DMS)
Type
Standard Grant (Standard)
Application #
1007603
Program Officer
Gabor Szekely
Project Start
Project End
Budget Start
2010-07-01
Budget End
2013-06-30
Support Year
Fiscal Year
2010
Total Cost
$176,595
Indirect Cost
Name
University of Minnesota Twin Cities
Department
Type
DUNS #
City
Minneapolis
State
MN
Country
United States
Zip Code
55455