With the recent advancements in biotechnology such as the use of genomewide microarrays and high throughput sequencing, regression-based modeling of high dimensional data in biological sciences has never been more important. The investigator aims to develop a regularized dimension reduction method for very high dimensional linear regression problems. The main thrust of the research is based on a well-established dimension reduction technique named Partial Least Squares (PLS) regression which has been heavily used in several scientific research areas where ill-posed problems commonly arise. The proposed work 1) theoretically investigates the suitability of PLS for very high dimensional regression settings where the number of predictors highly exceeds the available sample size; 2) proposes a regularization scheme that promotes variable selection in addition to dimension reduction; constructs rigorous mathematical formulations of the regularization scheme and characterizes their analytical solutions; 3) develops an efficient algorithm implementing the proposed framework. Extensions to interrelated classification and censored data settings are also considered.

The proposed work, when completed and disseminated, will provide a powerful simultaneous dimension reduction and variable selection framework relevant for all fields of scientific research that concern high dimensional ill-posed regression problems. This will allow scientists to analyze high-dimensional data with efficient dimension reduction and increased interpretability. The PI is actively involved in collaborations with biologists, biochemists, geneticists, and medical doctors. The research emanating from this proposal will therefore have strong interdisciplinary flavor and will be implemented, tested and tuned to address many real scientific questions of interest. The PI will apply the proposed research to problems arising in studying the variation of gene expression, transcription regulation, and binding properties of DNA binding proteins, where the selection of relevant variables is as important as having excellent predictive power. The project will integrate research and education by working closely with both graduate and undergraduate students.

Project Report

With the recent advancements in biotechnology such as the use of genomewidemicroarrays and high throughput sequencing, regression-based modeling of high dimensional data in biology as well as in other related fields has never been more important. This project developed and studied a regularized dimension reduction method for very high dimensional linear regression problems in three specific aims. The main thrust of the project is based on a well-established dimension reduction technique named Partial Least Squares (pls). pls is heavily used in several scientific research areas where ill-posed problems commonly arise. Successful completion of the first aim establishes theoretical properties of partial least squares for very high dimensional regression settings where the number of predictors highly exceeds the available sample size. In aim 2, a regularization scheme that promotes variable selection in addition to dimension reduction is developed. Rigorous mathematical formulations of the regularization scheme are constructed and analytical solutions are characterized. The proposed framework is implemented by an efficient algorithm as sparse pls. The third aim extends sparse pls to classification settings. The resulting methodology is implemented as a free available software package in statistical programming language R and disseminated through the world wide web. This work provides a powerful simultaneous dimension reduction and variable selection framework relevant for all fields of scientific research that concern high dimensional ill-posed regression problems. It allows scientists to analyze high-dimensional data with efficient dimension reduction and increased interpretability. The research emanating from this project have strong interdisciplinary flavor. Its applications in biological sciences involve joint analysis of gene expression and genome-wide binding data and expression quantitative loci mapping.

Agency
National Science Foundation (NSF)
Institute
Division of Mathematical Sciences (DMS)
Application #
0804597
Program Officer
Gabor J. Szekely
Project Start
Project End
Budget Start
2008-07-01
Budget End
2011-06-30
Support Year
Fiscal Year
2008
Total Cost
$100,001
Indirect Cost
Name
University of Wisconsin Madison
Department
Type
DUNS #
City
Madison
State
WI
Country
United States
Zip Code
53715