The increasing access to large databases across many global economic sectors, as well as various scientific disciplines, poses significant challenges that demand novel statistical methodologies for conducting data analytics. A common feature of these databases is the large number of subjects under study, and the large number of variables describing a wide variety of attributes attached to each subject. Guided by the dual principles of parsimony and accuracy, statistical models can be developed to serve as informative approximations to the unknown physical reality underlying the intricate mechanism of the data generation process. However, a lack of understanding of the nature and the pattern of complex nonlinear interactions between variables involved in the system has largely hindered the progress in the field of high-dimensional data analysis. In this project, novel statistical theory, methods, and software will be developed. These results will address key foundational issues and open up new avenues for research.

Regression is one of the most fundamental concepts in statistics, with applications in almost all disciplines. The growing interest in high-dimensional regression either from the forward or from the inverse perspective has brought out many facets that increase its flexibility in modeling complex data with proven statistical efficiency. Working in the data-intensive area of genomics, a novel statistical notion, liquid association (LA), has been developed that allows researchers to study dynamic patterns of co-regulation between genes. This project will investigate methods of employing LA for variable selection in high-dimensional regression. In combination with inverse modeling techniques including sliced inverse regression (SIR), hidden patterns of interaction between clusters of nonlinearly-correlated variables can be revealed. In many applications, it is often necessary to incorporate a certain set of background variables that are known to be important in the model. This adds another layer of complexity in revealing patterns in the interlocked variable-to-variable interactions. Methods of marginalizing the interference caused by background variables will also be developed.

Agency
National Science Foundation (NSF)
Institute
Division of Mathematical Sciences (DMS)
Application #
1513622
Program Officer
Gabor Szekely
Project Start
Project End
Budget Start
2015-09-01
Budget End
2020-03-31
Support Year
Fiscal Year
2015
Total Cost
$300,000
Indirect Cost
Name
University of California Los Angeles
Department
Type
DUNS #
City
Los Angeles
State
CA
Country
United States
Zip Code
90095